0

I'm currently working on a data tracking system. The system is a multiprocess application written in Python and working in the following manner:

  1. every S seconds it selects the N most appropriate tasks from the database (currently Postgres) and finds data for it
  2. if there's no tasks, it creates N new tasks and returns to (1).

The problem is following - currently I have approx. 80GB of data and 36M of tasks and the queries to the tasks table begin to work slower and slower (its the most populated and the most frequently used table).

The main bottleneck of performance is the task tracking query:

LOCK TABLE task IN ACCESS EXCLUSIVE MODE;
SELECT * FROM task WHERE line = 1 AND action = ANY(ARRAY['Find', 'Get']) AND (stat IN ('', 'CR1') OR stat = 'ERROR' AND (actiondate <= NOW() OR actiondate IS NULL)) ORDER BY taskid, actiondate, action DESC, idtype, date ASC LIMIT 36;

                                    Table "public.task"
   Column   |            Type             |                    Modifiers
------------+-----------------------------+-------------------------------------------------
 number     | character varying(16)       | not null
 date       | timestamp without time zone | default now()
 stat       | character varying(16)       | not null default ''::character varying
 idtype     | character varying(16)       | not null default 'container'::character varying
 uri        | character varying(1024)     |
 action     | character varying(16)       | not null default 'Find'::character varying
 reason     | character varying(4096)     | not null default ''::character varying
 rev        | integer                     | not null default 0
 actiondate | timestamp without time zone |
 modifydate | timestamp without time zone |
 line       | integer                     |
 datasource | character varying(512)      |
 taskid     | character varying(32)       |
 found      | integer                     | not null default 0
Indexes:
    "task_pkey" PRIMARY KEY, btree (idtype, number)
    "action_index" btree (action)
    "actiondate_index" btree (actiondate)
    "date_index" btree (date)
    "line_index" btree (line)
    "modifydate_index" btree (modifydate)
    "stat_index" btree (stat)
    "taskid_index" btree (taskid)

                               QUERY PLAN                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=312638.87..312638.96 rows=36 width=668) (actual time=1838.193..1838.197 rows=36 loops=1)
   ->  Sort  (cost=312638.87..313149.54 rows=204267 width=668) (actual time=1838.192..1838.194 rows=36 loops=1)
         Sort Key: taskid, actiondate, action, idtype, date
         Sort Method: top-N heapsort  Memory: 43kB
         ->  Bitmap Heap Scan on task  (cost=107497.61..306337.31 rows=204267 width=668) (actual time=1013.491..1343.751 rows=914586 loops=1)
               Recheck Cond: ((((stat)::text = ANY ('{"",CR1}'::text[])) OR ((stat)::text = 'ERROR'::text)) AND (line = 1))
               Filter: (((action)::text = ANY ('{Find,Get}'::text[])) AND (((stat)::text = ANY ('{"",CR1}'::text[])) OR (((stat)::text = 'ERROR'::text) AND ((actiondate <= now()) OR (actiondate IS NULL)))))
               Rows Removed by Filter: 133
               Heap Blocks: exact=76064
               ->  BitmapAnd  (cost=107497.61..107497.61 rows=237348 width=0) (actual time=999.457..999.457 rows=0 loops=1)
                     ->  BitmapOr  (cost=9949.15..9949.15 rows=964044 width=0) (actual time=121.936..121.936 rows=0 loops=1)
                           ->  Bitmap Index Scan on stat_index  (cost=0.00..9449.46 rows=925379 width=0) (actual time=117.791..117.791 rows=920900 loops=1)
                                 Index Cond: ((stat)::text = ANY ('{"",CR1}'::text[]))
                           ->  Bitmap Index Scan on stat_index  (cost=0.00..397.55 rows=38665 width=0) (actual time=4.144..4.144 rows=30262 loops=1)
                                 Index Cond: ((stat)::text = 'ERROR'::text)
                     ->  Bitmap Index Scan on line_index  (cost=0.00..97497.14 rows=9519277 width=0) (actual time=853.033..853.033 rows=9605462 loops=1)
                           Index Cond: (line = 1)
 Planning time: 0.284 ms
 Execution time: 1838.882 ms
(19 rows)

Of course, all involved fields are indexed. I'm currently thinking in two directions:

  1. how to optimize the query and will it actually give me a performance improvement for perspective or not (currently it takes approx. 10 seconds per query which is inacceptable in dynamic task tracking)
  2. where and how it would be more effective to store the task data - may be I should use another DB for such purposes - Cassandra, VoltDB or another Big Data store?

I think that the data should be somehow preordered to get actual tasks as fast as possible.

And also please keep in mind that my current volume of 80G is most likely a minimum rather than maximum for a such task.

Thanks in advance!

Community
  • 1
  • 1
Severogor
  • 33
  • 5
  • 1
    1) Do you *really* need all these huge varchar-fields, cant you move them to a separate table(s) ? 2) Especially the primary key, and the low-cardinality {status,type,action} fields could be compacted to integer types. 3) for *typical* queries like this, partial (*composite*) indexes could be employed. 4) the meaning of the table is not clear, it appears to combine {status,action,logging} kinds of facts. – wildplasser Jun 11 '16 at 13:43

1 Answers1

0

I don't quite understand your use case, but it doesn't look to me like your indexes are working too well. It looks like the query is relying mostly on the stat index. I think you need to look into a composite index something like (action, line, stat).

Another option is to shard your data across multiple tables splitting it on some key with low cardinality. I don't use postgres but I don't think looking at another db solution is going to work better unless you know exactly what you're optimizing for.

Josh Wilson
  • 3,585
  • 7
  • 32
  • 53
  • Could you please explain a bit your last sentence? – Severogor Jun 11 '16 at 08:44
  • The other databases you mentioned are each good at some things and bad at others. Personally I think postgres is easiest to learn how to optimize first so I would stick with that unless you have a really compelling reason to switch. (A stackoverflow answer isn't going to give you enough information to know if you should unless you already know what to ask). – Josh Wilson Jun 11 '16 at 17:16