Efficient way to get frequency distribution of values in a large MySql table

Question

I have two tables viz. Total_Data and Distinct_S1.

Total_Data has 3.5 million rows. Fields: "S1", "S2", "S3", "S4"

Distinct_S1 has 1 million rows. Fields: "S1", "frequency". "S1" of Distinct_S1 consists of all distinct values which occur in "S1" of Total_Data.

Task at hand: "frequency" of Distinct_S1 for S1 = 'xxyy' (suppose) should consist of the number of times 'xxyy' occurs in Total_Data.

I used the following query from within a python script(MySQLdb imported) to accomplish the Task at hand:

cur.execute("update Distinct_S1 set frequency=(select count(*) from Total_Data where S1='%s') where S1='%s'"%(S1_val, S1_val)

The above query works fine but it seems to take a lot of time. Is there any other faster way to achieve the same?

score 1 · Accepted Answer · answered Jul 21 '12 at 18:17

1

You can use the multi-table update syntax with a join to update all rows at once. See the manual for the syntax.

But a much easier way is to also insert all the values in the table directly without first populating it with the distinct values.

INSERT INTO Distinct_S1 (S1, frequency)
SELECT S1, COUNT(*)
FROM Total_Data
GROUP BY S1

answered Jul 21 '12 at 18:17

Mark Byers

811,555
193
1,581
1,452

I ran your above query. After 13 minutes, it's done! Thanks a lot! P.S. I guess updates are more expensive than inserts and 'python-to-mysql-to-python transition' might have taken a lot of time. – yang5 Jul 21 '12 at 18:49

Efficient way to get frequency distribution of values in a large MySql table

1 Answers1