Logging high volume of impression data (50 million records/month)

Question

We are currently logging impression data for several websites using MySQL and are seeking a more appropriate replacement for logging the high volume of traffic our sites now see. What we ultimately need in the MySQL database is aggregated data.

By "high volume" I mean that we are logging about 50 million entries per month for this impression data. It is important to note that this table activity is almost exclusively write and only rarely read. (Different from this use-case on SO: Which NoSQL database for extremely high volumes of data). We have worked around some of the MySQL performance issues by partitioning the data by range and performing bulk inserts, but in the big picture, we shouldn't be using MySQL.

What we ultimately need in the MySQL database is aggregated data and I believe there are other technologies much better suited for the high-volume logging portion of this use-case. I have read about mongodb, HBase (with map reduce), Cassandra, and Apache Flume and I feel like I'm on the right track, but need some guidance on what technology (or combination) I should be looking at.

What I would like to know specifically is what platforms are best suited for high-volume logging and how to get an aggregated/reduced data set fed into MySQL on a daily basis.

RojoSam · Accepted Answer · 2015-09-10T20:19:17.633

1

Hive doesn't store information, it only allow you to query "raw" data with like sql language (HQL).

If your aggregated data is enough small to be stored in MySQL and that is the only use of your data, then HBase could be too much for you.

My suggestion is use Hadoop (HDFS and MapReduce

Create log files (text files) with the impression events.
Then move them into HDFS (an alternative could be use kafka or storm if you require a near real-time solution).
Create a MapReduce job capable to read and aggregate your logs and in the reduce output use a DBOutputFormat to store the aggregated data into MySql.

edited Sep 10 '15 at 20:19

answered Sep 10 '15 at 20:14

RojoSam

1,476
12
15

Yes, the aggregated data would reduce to around 100 records per day, which is very manageable in MySQL. Can you elaborate a bit on #1? We are using PHP in the application and if I understand #1 correctly, we would just write the impression events to the filesystem. A cron job could run periodically to move this data into HDFS. If we opted for the alternative in #2, would we feed data directly from our application into kafka or storm, thus eliminating #1? Thanks for the feedback. – Nigel Tufnel Sep 10 '15 at 20:52
1

To take adventage of hadoop capabilities you have to use HDFS Filesystem as storage (distributed file system with distributed processing). There are different ways to put data into HDFS. For non real-time solutions the most common case is to generate log files locally in the web applications or in the bckend (your php app) and then move them into hdfs. Distcp is a good tool to copy data inside and outside HDFS and it is very easy tu use it. – RojoSam Sep 11 '15 at 01:05
1

Kafka and storm are more to receive at runtime the events generated by the source application and each amount of time or events generate a files inside the hdfs filesystem (in our case). In this case instad to write your logs into the filesystem you will need to send the events to a queue (kafka). There are no way to avoid write the event in a file and send them to someplace the difference is the complexity of the architecture (and may be the order of the steps). The easiest solution is to write logs locally and then send the files into HDFS using Distcp. – RojoSam Sep 11 '15 at 01:14
I've been reading up on HDFS and it does sound like a viable solution technologically. My main concern is the cost. We are running on Rackspace and it appears we'll need at least a few cloud servers to dedicate for Hadoop. Rackspace has a "big data" offering, but it is quite pricey for our needs. What would be the minimum setup you'd recommend for a Hadoop deployment? Thanks, is has been really helpful information. – Nigel Tufnel Sep 11 '15 at 03:29
All depends on your data volume and the complexity of your aggregation. But I guess that you will be ok with a hadoop cluster on demand like EMR (Amazon service). Where you only pay for the amount of time that you have used the cluster. Then execute a small job daili shouldn't be expensive. The problem with that is that you won't have direct access to their filesystem and they won't have direct access to your DB. Then you should use their filesystem like S3 in Amazon case and then you will need transfer the output to yout system and then upload the resut into your MySQL. – RojoSam Sep 11 '15 at 03:47

score 0 · Answer 2 · answered Sep 10 '15 at 09:44

0

One approach could be to simply dump the raw impression log into flat files. There would be a daily batch which will process these file using MapReduce program. The MapReduce aggregated output could be stored into Hive or HBase.

Please let me know, if you see any problem in this approach. The Bigdata technology stack have many option based on type of data and the way it needs to be aggregated.

answered Sep 10 '15 at 09:44

Gyanendra Dwivedi

5,511
2
27
53

The daily batch is something I had in mind, using MapReduce (or comparable technology) on the data set to feed *into* MySQL. However, the core problem we are wanting to address is not using MySQL for the high-volume impression logging. This is where another storage option comes into play, such as Hive or HBase. I had in mind to store the aggregated data in MySQL, although it could reside in the same data store as the raw logging data. Would you recommend either Hive or HBase for the raw impression logging? – Nigel Tufnel Sep 10 '15 at 13:15
For raw impression logging, it depends on whether the data is structured and can be represented in a table format. You could use Hie in that case. – Gyanendra Dwivedi Sep 11 '15 at 05:00

Logging high volume of impression data (50 million records/month)

2 Answers2