I'm currently working on a data tracking system. The system is a multiprocess application written in Python and working in the following manner:
- every S seconds it selects the N most appropriate tasks from the database (currently Postgres) and finds data for it
- if there's no tasks, it creates N new tasks and returns to (1).
The problem is following - currently I have approx. 80GB of data and 36M of tasks and the queries to the tasks table begin to work slower and slower (its the most populated and the most frequently used table).
The main bottleneck of performance is the task tracking query:
LOCK TABLE task IN ACCESS EXCLUSIVE MODE;
SELECT * FROM task WHERE line = 1 AND action = ANY(ARRAY['Find', 'Get']) AND (stat IN ('', 'CR1') OR stat = 'ERROR' AND (actiondate <= NOW() OR actiondate IS NULL)) ORDER BY taskid, actiondate, action DESC, idtype, date ASC LIMIT 36;
Table "public.task"
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------------------
number | character varying(16) | not null
date | timestamp without time zone | default now()
stat | character varying(16) | not null default ''::character varying
idtype | character varying(16) | not null default 'container'::character varying
uri | character varying(1024) |
action | character varying(16) | not null default 'Find'::character varying
reason | character varying(4096) | not null default ''::character varying
rev | integer | not null default 0
actiondate | timestamp without time zone |
modifydate | timestamp without time zone |
line | integer |
datasource | character varying(512) |
taskid | character varying(32) |
found | integer | not null default 0
Indexes:
"task_pkey" PRIMARY KEY, btree (idtype, number)
"action_index" btree (action)
"actiondate_index" btree (actiondate)
"date_index" btree (date)
"line_index" btree (line)
"modifydate_index" btree (modifydate)
"stat_index" btree (stat)
"taskid_index" btree (taskid)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=312638.87..312638.96 rows=36 width=668) (actual time=1838.193..1838.197 rows=36 loops=1)
-> Sort (cost=312638.87..313149.54 rows=204267 width=668) (actual time=1838.192..1838.194 rows=36 loops=1)
Sort Key: taskid, actiondate, action, idtype, date
Sort Method: top-N heapsort Memory: 43kB
-> Bitmap Heap Scan on task (cost=107497.61..306337.31 rows=204267 width=668) (actual time=1013.491..1343.751 rows=914586 loops=1)
Recheck Cond: ((((stat)::text = ANY ('{"",CR1}'::text[])) OR ((stat)::text = 'ERROR'::text)) AND (line = 1))
Filter: (((action)::text = ANY ('{Find,Get}'::text[])) AND (((stat)::text = ANY ('{"",CR1}'::text[])) OR (((stat)::text = 'ERROR'::text) AND ((actiondate <= now()) OR (actiondate IS NULL)))))
Rows Removed by Filter: 133
Heap Blocks: exact=76064
-> BitmapAnd (cost=107497.61..107497.61 rows=237348 width=0) (actual time=999.457..999.457 rows=0 loops=1)
-> BitmapOr (cost=9949.15..9949.15 rows=964044 width=0) (actual time=121.936..121.936 rows=0 loops=1)
-> Bitmap Index Scan on stat_index (cost=0.00..9449.46 rows=925379 width=0) (actual time=117.791..117.791 rows=920900 loops=1)
Index Cond: ((stat)::text = ANY ('{"",CR1}'::text[]))
-> Bitmap Index Scan on stat_index (cost=0.00..397.55 rows=38665 width=0) (actual time=4.144..4.144 rows=30262 loops=1)
Index Cond: ((stat)::text = 'ERROR'::text)
-> Bitmap Index Scan on line_index (cost=0.00..97497.14 rows=9519277 width=0) (actual time=853.033..853.033 rows=9605462 loops=1)
Index Cond: (line = 1)
Planning time: 0.284 ms
Execution time: 1838.882 ms
(19 rows)
Of course, all involved fields are indexed. I'm currently thinking in two directions:
- how to optimize the query and will it actually give me a performance improvement for perspective or not (currently it takes approx. 10 seconds per query which is inacceptable in dynamic task tracking)
- where and how it would be more effective to store the task data - may be I should use another DB for such purposes - Cassandra, VoltDB or another Big Data store?
I think that the data should be somehow preordered to get actual tasks as fast as possible.
And also please keep in mind that my current volume of 80G is most likely a minimum rather than maximum for a such task.
Thanks in advance!