1

I have a PostgreSQL database which contains a lot of rows ~160,000,000. The database is basically a set of network logs. There is a column called time which is a time stamp. Basically for every unique time there is a log for each machine on the network. Thus there are multiple rows with the same time stamp. i.e.

time    ip             value
2:00    192.168.1.1    5
2:00    192.168.1.2    4
2:00    192.168.1.3    5
3:00    192.168.1.1    3
3:00    192.168.1.2    5
3:00    192.168.1.3    2
etc     etc

Now I have created an index for this database by time as this is the field which I always query with relation to.

i.e. select * from networklogs where time = '2:00' and value = 5

Without using the index every query takes approximately 2 mins due to the query having to scan the whole database. I created the index such that my query will find the subset of data containing a matching time and then filter based on another value from here and thus I hoped the query would be a lot faster.

However the queries seem to take longer now...

This is strange because in pgAdmin it explains that the query will use the time index and that the time index narrows the data to ~800,000 items from a total 160,000,000, and then the filter on the query narrows this data down to 1,700 items. This explanation takes 17ms however if I run the query it takes 3 mins.

This should be faster than searching for two matching criteria on 160,000,000 items!

I cannot work out why it isn't faster, I'm wondering if even though pgAdmin explains the query as using the time index does it actually use it when the query is performed?

Does anyone have any ideas or suggestions?

Thanks, James

UPDATE: I have run an EXPLAIN ANALYZE on the following command:

explain analyze select latitude,longitude from networklogs where value = 5 and time = '2:00'

The results are:

"Bitmap Heap Scan on networklogs  (cost=13178.17..1488620.94 rows=848 width=19) (actual time=3507.923..278760.509 rows=33 loops=1)"
"  Recheck Cond: (time = '2:00'::timestamp without time zone)"
"  Rows Removed by Index Recheck: 38302021"
"  Filter: (value = 5)"
"  Rows Removed by Filter: 882873"
"  ->  Bitmap Index Scan on timeindex  (cost=0.00..13177.95 rows=530111 width=0) (actual time=695.771..695.771 rows=882906 loops=1)"
"        Index Cond: (time = '2:00'::timestamp without time zone)"
"Total runtime: 278764.148 ms"
titanofold
  • 2,852
  • 1
  • 15
  • 21
James Elder
  • 1,583
  • 3
  • 22
  • 34
  • Do you have any privilege to back up old data from certain time point and only keep x amount of recent data? – bonCodigo Dec 03 '12 at 16:03
  • It's actually slower now than before (tried a couple of queries for comparison) :P I'm going to try dbrobins suggestion. – James Elder Dec 03 '12 at 16:04
  • Additionally, PostgreSQL can use subsets of multicolumn indexes (particularly the leftmost columns). Thus, if you always restrict by `time` and sometimes also by `value`, an index on `(time,value)` would be useful for both queries. – willglynn Dec 03 '12 at 16:04
  • Can you please show the `CREATE INDEX` statement you used ? – greg Dec 03 '12 at 16:12
  • @bonCodigo Actually I am not working in real time, so I cant edit the logs. I am trying to create a software to analyze the past logs. – James Elder Dec 03 '12 at 16:21
  • @willglynn thanks, I am restricting by different columns at different time, it is only the time column which I always restrict by. – James Elder Dec 03 '12 at 16:22
  • @greg I think the command was `CREATE INDEX timeindex ON networklogs (time)` but this is from memory... In PG Admin it does show an index on time under the properties of the table. – James Elder Dec 03 '12 at 16:25

2 Answers2

3

If these are mostly the parameters you are going to perform searchs against, I would advise to create a multiple index on them. Because, in your actual situation, the database has an index on time so it is able to know which rows match this condition but then it has to fetch them from all the different places given by the index and then check the secondary condition.

CREATE INDEX time_and_value ON networklogs (time, value);

Using this index, the database will not have to fetch data in order to check another condition, it will simply fetch the data that it knows already match the search criteria. Order is important of course.

I see you are using a latitude and longitude data, maybe using the point type would be good for that so you would be able to use all the geometric operators that come by default with Postgres. These kind of data can be indexed as well.

greg
  • 3,354
  • 1
  • 24
  • 35
  • Looking at the EXPLAIN output, the `time` index matches 882,906 rows in 0.7 seconds. The database then takes another 27.1 seconds to fetch all those rows (presumably because of [lots of disk seeks](http://stackoverflow.com/a/13236089/1026671)), only to discard 99.996% of them. You definitely need an index on both columns. – willglynn Dec 03 '12 at 16:46
  • I'm sorry, that's 278 seconds (3.16 ms/row), not 27 seconds. Sounds like random reads to me. – willglynn Dec 03 '12 at 16:55
  • Thanks for the answers. Well I would therefore need to create a few multiple indexes, is this possible? For example my actual database has more columns than the example ones I showed above. I would need some indexes like time & value, time & flag, time & noconn... – James Elder Dec 03 '12 at 16:55
  • 1
    Yes. You can create as many indexes as you want, provided you have enough disk space and that your database server can keep up with the extra work on insert/update/delete operations. – willglynn Dec 03 '12 at 16:56
  • Thanks I will modify the database for these indexes then. And thanks for the explanation as I understand how the indexing system works a lot better now. @greg I just wanted to say that for my solution I am not allowed to modify the datatypes, however I will bear this in mind for the future. – James Elder Dec 03 '12 at 16:59
  • If I could advise you a book (I have no interest in that one) it would be «SQL performances explained» by Markus Winand. There is also an online version available. – greg Dec 03 '12 at 17:02
  • Thanks for the suggestion, I'll have a look at it! – James Elder Dec 03 '12 at 22:15
2

PostgreSQL has EXPLAIN and EXPLAIN ANALYZE to examine how a query is run. It is likely that PG Admin uses EXPLAIN to tell you how the query will run; you may get more accurate results using EXPLAIN ANALYZE, which actually runs the query and produces statistics from the actual run. You could at least see where time was spent, even if it is using the indexes as expected.

dbrobins
  • 499
  • 1
  • 3
  • 9
  • I ran the explain analyze but could you please explain to me what this means: `"Bitmap Heap Scan on networklogs (cost=13178.17..1488620.94 rows=848 width=19) (actual time=3507.923..278760.509 rows=33 loops=1)"` in terms of the time taken and the action performed? this is the first line from the output by the way... – James Elder Dec 03 '12 at 16:27
  • A bitmap heap scan is like a sequential scan except it restricts the scan to a range, but still fetches a lot of rows. As another answer points out, you would benefit from adding `value` to the index. – dbrobins Dec 03 '12 at 21:21