0

I have a pretty huge data set of products and users and their time of usage.

There are around 100 million rows and takes around 10 GB of disk space.

the dataset is on the following order:

userid     itemid      purchase_date    
1             1          2018-12-22
11            1          2018-12-22
11            4          2018-12-22
12            4          2018-12-22
11            5          2018-12-22

.......100M+ rows.....

I also added such index,

ALTER TABLE purchase_data ADD INDEX (userid);
ALTER TABLE purchase_data ADD INDEX (itemid);
ALTER TABLE purchase_data ADD INDEX (purchase_date);

Let say I want to find all the users who purchased a product(item 1) and then find all the other item he purchased.

Select itemid from purchase_data
    where userid in (Select userid, from purchase_data
                    where itemid=1)
      and itemid!=1

this query takes forever to run.

Secondly, I also want to add up all the common items between those users like userid 11 4 and userid 12 also brought 4 so I want to add 4 with count 2

I wrote a similar query for that:

Select itemid,count(*) from purchase_data
    where userid in (Select userid, from purchase_data
                      where itemid=1)
      and itemid!=1
    group by itemid
    having count(itemid)>=1

This script also takes infinite time.

Please help,

Thank You

Rick James
  • 135,179
  • 13
  • 127
  • 222
Arnabhik
  • 21
  • 3
  • Can you post also the EXPLAIN results for those queries? BTW, the first one has a typo, you have a "," after "user_id" in the subquery). – Andrei Kovacs Jan 20 '19 at 14:58
  • Did you confirm the database really used the index? `EXPLAIN `.. Making a index does mean the database should use it.. Besides calculate the index selectivity to see if the index makes sense `Select COUNT(DISTINCT itemid)/COUNT(*) AS index_selectivity from purchase_data`, ideally it should be a value close to 1 – Raymond Nijland Jan 20 '19 at 15:04
  • There are syntax errors. Did you leave out something critical? – Rick James Jan 20 '19 at 22:50

1 Answers1

0

Instead of IN clause you should use a inner join eg:

Select itemid 
from purchase_data  a 
INNER JOIN  (
    Select userid
     from purchase_data where itemid=1
    ) T on t.userid = a,userid 
  where a.itemid != 1 

a IN clause work as several OR condition while the inner join work as a single relation ..

And Instead of several index with one column you should delete these index and create an composite index with the column involved in join condition on left and others column on the right

create index my_idx on  purchase_data(userid, itemid );

the same for the grouped query

Select itemid , count(*)
from purchase_data  a 
INNER JOIN  (
    Select userid
     from purchase_data where itemid=1
    ) T on t.userid = a,userid 
  where itemid != 1 
group by itemid 
having count(itemid)>=1
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
  • Removing the single indexes and replacing them with this composite index might make things worse here. The RDBMS needs to go over the table twice: once with the predicate `itemid = 1` and once with the predicates `t.user_id=a.user_id` & `a.itemid !=1`. The first one will have absolutely no benefit from your suggested index, the second one might, although it probably will only be because the column is included and avoids a bookmark lookup. `!=1` won't be excluding much rows and we need to go fetch the itemid anway (unless that's the PK in which case (at least in MSSQL) it's implicitly included) – deroby Jan 24 '19 at 11:17