1

I have a performance trouble with the merging of two columns into one for table size of 7m entries.

My main goal is:

  • merge open_time and close_time in one 'time' column
  • order time and account_id by DESC
  • check op_type column for each account_id based on ordered entries by time and account_id:
    • if op_type was changed from 0 to 1 or vice versa for first and second entries, make counter +1.
    • if op_type was changed from 0 to 1 or vice versa for second and third entries again, make counter +1 again.
    • and so on for all entries for each account_id

account_id, open_time and close_time are indexed.

Option #1: Merge open_time and close_time columns into one column 'time' using the union of two select statements:

select account_id, op_type, open_time as time, instrument_id
from tmp_operations
UNION
select account_id, op_type=0, close_time as time, instrument_id
from tmp_operations

Execution time of union two select statement is over 4 000 000 ms and still running.

Option #2: Merge open_time and close_time columns into one column 'time' using the unnest of array:

SELECT
    account_id,
    op_type,
    unnest(ARRAY[open_time, close_time]) as time,
    instrument_id
FROM risklive.operations_mt4 op

Execution time of unnesting array is about 315 000 ms which is more better. Thanks Gabriel's Messanger!

Some samples what I want to see as the result for timestamps merging:

      open_time               close_time                       time
"2015-08-19 09:18:24"    "2015-08-19 09:20:40"          "2015-08-19 09:18:24" 
"2015-08-19 09:11:54"    "2015-08-19 09:17:16"    -->   "2015-08-19 09:20:40"
"2015-08-19 09:17:46"    "2015-08-19 09:18:22"          "2015-08-19 09:11:54"
                                                        "2015-08-19 09:17:16"
                                                        "2015-08-19 09:17:16"
                                                        "2015-08-19 09:17:46"
                                                        "2015-08-19 09:18:22"

As for op_type column changes counter for each entries per account_id:

account_id   op_type         time
  63004;        1;    "2015-08-19 09:18:24"
  63004;        1;    "2015-08-19 09:20:40"
  63004;        1;    "2015-08-19 09:11:54"
  63004;        1;    "2015-08-19 09:17:16"   <-- op_type will be changed in next entry
  63004;        0;    "2015-08-19 09:17:46"   <-- counter = 1
  63004;        0;    "2015-08-19 09:18:22"   <-- op_type will be changed in next entry
  63004;        1;    "2015-08-19 09:09:31"   <-- counter = 2
  63004;        1;    "2015-08-19 09:09:31"
  63004;        1;    "2015-08-19 09:31:09"
  63004;        1;    "2015-08-19 09:32:07"   <-- op_type will be changed in next entry
  63004;        0;    "2015-08-19 09:32:09"   <-- counter = 3
  63004;        0;    "2015-08-19 09:57:44"   <-- op_type will be changed in next entry
  63004;        1;    "2015-08-19 09:20:43"   <-- counter = 4
  63004;        1;    "2015-08-19 09:31:51"
  63004;        1;    "2015-08-19 09:20:59"
  63004;        1;    "2015-08-19 09:31:51"

Above op_type changes counter I don't know how to implement at this moment.

How can I tune all of it?

Viktor M.
  • 4,393
  • 9
  • 40
  • 71
  • It's unclear to me what you are trying to achieve. The result will only have a _single_ column named `open_time`, it won't have a column named `close_time` because the column names of the second query are irrelevant. Please **edit** your question and add some sample data and the expected output based on that data. –  Nov 27 '15 at 12:17
  • 1
    _Not tested and I don't know whether this will work for you just try_ `coalesce(nullif(0,op_type),1)` instead of `(CASE WHEN op_type=0 THEN 1 ELSE 0 END)` – Vivek S. Nov 27 '15 at 12:23
  • @wingedpanther I doubt that `COALESCE` is his biggest problem in terms of performance, but +1 for introduce `nullif` function. It's handy little devil:) – Gabriel's Messanger Nov 27 '15 at 12:46
  • @user1376885 oh, you made your question more precise. This changed the problem in terms of `op_type`. Let me think... – Gabriel's Messanger Nov 27 '15 at 13:55
  • Can [open_time, close_time] ranges for one `account_id` overlap? ex. `1, '201-08-15 19:00', '201-08-15 19:30'` and `2, '201-08-15 19:10', '201-08-15 19:40'` ? – Gabriel's Messanger Nov 27 '15 at 14:23
  • Only if these entries goes in sequence and no changes of op_type are between this timestamps. – Viktor M. Nov 27 '15 at 14:33
  • One more question. Do you need to know "when" te counter changes or the last value for every `account_id` is needed? As in your example: do you need to know that for `acount_id=63004` `counter=4` or need to know that at `x` row `counter=1 and `y` row `counter=2` etc – Gabriel's Messanger Nov 27 '15 at 14:46
  • I need to know only how much changes of op_type was detected for each account based on sorted account, time. - `acount_id=63004 counter=4` – Viktor M. Nov 27 '15 at 14:50

1 Answers1

1

Since you've changed your question a little but after my answer, I completly rewrite my post too.

Disclaimer:

You need to perform opperation requires merging, and order whole table (7M rows) it's everytime a bottle neck. It's possible that you won't find solution wchich satisfy you whitout changing your appraoch entirely. Nonetheless let me try.

First problem:

So your first problem was to "merge" two columns into one for whole table of 7M rows. You try UNION which needs two seq scan. As I propose before solution may be using array aggregation and unnest (wchich you did):

SELECT
    account_id,
    op_type,
    unnest(ARRAY[open_time, close_time]) as time,
    instrument_id
FROM risklive.operations_mt4 op

Second problem:

is counting op_type changes for account_id while ordered by "merge" time column. For readability I use CTE to put "merged table" in.

We must use subquery. At one level we check for op_type changes with proper order (using lag() WINDOW FUNCTION which returns value one row before current row). At second level we sumarize number of op_type changes.

WITH merged_table AS (
    SELECT
        account_id,
        op_type,
        unnest(ARRAY[open_time, close_time]) as time,
        instrument_id
    FROM risklive.operations_mt4 op
)
SELECT 
    account_id, SUM(abs(x)) as counter
FROM (
    SELECT
         m.account_id,
         (m.op_type - lag(m.op_type)
                 OVER (PARTITION BY m.account_id ORDER BY time)
         ) as zero_if_no_op_type_change
    FROM merged_table m
) sub
GROUP BY account_id

Unfortunatly it may take too long for your needs. If so it's hard to do much more improvments in my oppinion.

Gabriel's Messanger
  • 3,213
  • 17
  • 31
  • Thanks Gabriel, it helps me to merge timestamps together with execution time = 315 000 ms!=) That's more better than union. The last place where I stuck is op_type changes counter. Do you have any ideas how it is possible to do without duplicate of one more op_type column with reverse variables and with following new op_type column shift for further comparison of one op_Type column with another one. – Viktor M. Nov 27 '15 at 13:43
  • You really change your question. To be honest this what you did previously wich `CASE WHEN` and `op_type` did't do what you now expect, which made my answer not perfectly good. I try to rewrite my answer to meet your criteria. – Gabriel's Messanger Nov 27 '15 at 13:59
  • Thanks Gabriel, some tips from your answer very help me to improve performance! – Viktor M. Nov 30 '15 at 11:39
  • One more question, what index algorithm is preferable to use for a huge amount of data if we use this table with 14M rows in joins of different sql statements? One table(~10M rows) joins with this table(~14M rows). – Viktor M. Dec 02 '15 at 17:07
  • 1
    Choose of index algorithm doesn't depend on table sizes, but column type and operation in which index is supposed to be used. For single types comparisions (like `int`, `text`, `number` etc and operator like =, <, >, `BETWEEN` etc.) b-tree is really sufficient. Honestly b-tree is sufficient most of the times. Other type (GIN, GIST) are useful for example for array columns, or `range` types. Basicly if there's something you can't do with b-tree, then check out other types. Also [read manual](http://www.postgresql.org/docs/9.4/static/indexes-types.html) – Gabriel's Messanger Dec 02 '15 at 18:47
  • Thank you Gabriel again! – Viktor M. Dec 03 '15 at 08:46