Should I process a large amount of data with SQL or Ruby?

Question

I have a MySQL table with hundreds of thousands of entries in it.

I need to specify a date range and select all the entries between those two dates. I then need to break down the entries hour by hour and obtain a specific field.

Use Case: I would need an hour-to-hour breakdown of the entries from June 6 to June 12. So, on June 6th I would need an array of entries from 12am to 1am, 1am to 2am, 2am to 3am, etc.

Which is faster? (and why!):

Running 144 SQL queries to SELECT device_id FROM entries WHERE updated_at >= sometime AND updated_at <= sometime+1.hour where sometime is something like 12pm to 1pm, and then 1pm to 2pm, etc.
Running 1 SQL query to SELECT device_id FROM entries WHERE updated_at >= start_date AND updated_at <= end_date to get all the entries within the entire time period, and then use ruby to group the entries down by hour.

If anyone has any resources on why ruby or MySQL might be faster for this type of stuff, it would be much appreciated. Would like to do some reading.

Describe "logs". Do you mean individual lines/records, or do you store an entire log file in a character field? It's hard for us to see over your shoulder to know what you mean, so examples really help us understand what you're talking about. Once we know that we can make a better recommendation. Until then we're shooting in the dark. — the Tin Man, Jun 12 '13 at 21:33
Sorry, logs is a bad name cause of the confusion with log files. It's just a table called logs, but might as well be called entries. I'll update my question. — Brian Weinreich, Jun 12 '13 at 22:35
One way to go about this is to truncate your `updated_at` field value to drop the minutes and seconds and do a `group by` on the resulting datetime. That will give you hourly groups. See http://stackoverflow.com/questions/7928537/mysql-select-datetime-similar-up-to-the-minute and http://stackoverflow.com/questions/1921362/how-to-round-a-datetime-in-mysql and http://stackoverflow.com/questions/9922840/mysql-datetime-range-select?rq=1 for ideas. — the Tin Man, Jun 12 '13 at 23:12

score 2 · Answer 1 · answered Jun 12 '13 at 18:45

2

SQL databases are built for such operations. Use them! You can always do some benchmarks but the results seem very predictable to me.

answered Jun 12 '13 at 18:45

Manu Clementz

1,807
12
19

score 1 · Accepted Answer · answered Jun 12 '13 at 19:19

I think the best way to do that is to have a updated_at_month, updated_at_day, updated_at_hour (TINYINT - as all values are very small) INDEXED fields when you insert the log, and then select the range and group by like that:

select device_id FROM logs where updated_at BETWEEN sometime AND some_other_time group by updated_at_month, updated_at_day, updated_at_hour

I assume updated_at is indexed field.

or

select device_id FROM logs where updated_at_month=6 and updated_at_day>=6 and updated_at_day<=12 group by updated_at_day, updated_at_hour

But, how many rows are we talking about in total? And as i noticed you select device_id field from that without any counts...or groups...perhaps you should select device_id, updated_at_day,updated_at_hour, count(*) as times

This led me to the right direction.. I ended up with this SQL call which was exactly what I needed and ALOT faster: `SELECT COUNT(DISTINCT(device_id)) AS device_count, updated_at FROM logs WHERE waypoint_id=1 AND updated_at >= '2013-05-15 13:33:47 -0400' AND updated_at <= '2013-06-16 13:33:45 -0400' GROUP BY DAY(updated_at)` — Brian Weinreich, Jun 16 '13 at 19:01

Should I process a large amount of data with SQL or Ruby?

2 Answers2