-3

I have ~4gb of text file which I parse and save the data in a db. This process almost take 3-4hr(5-6 million lines) to process and save data in db. And this is a everyday process.

Now when I query the db its taking too much time to compute result and return. Like if I do a simple avg, sum operation for a particular day its taking 30-40mins.

I am using python, mysql right now. Tried Spark also to do this computation which also taking 30-40 min and now data is increasing so file size will increase and it will be like 10gb, which spark is not able to handle large files.

Please suggest how can I improve this time of parsing, storing in db, and fetching time.

Community
  • 1
  • 1
A J
  • 3,684
  • 2
  • 19
  • 24
  • Have you tried Hadoop or similar map-reduce approach to parallelify the processing of your data? – Sean Azlin Mar 27 '15 at 07:20
  • I used Spark because of some online discussion I found it better then Hadoop, https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html Not sure about Hadoop or map reduce. – A J Mar 27 '15 at 07:26
  • You could use https://github.com/databricks/spark-csv to load your CSV file. If you use the DataFrame API, you should see a decent performance improvement over the normal API with Python. – Sietse Mar 27 '15 at 10:39
  • 2
    _"spark is not able to handle large files"_ That's false. Spark is specifically a solution for dealing with big data. You just linked to a post where it sorted 1 PB. – Daniel Darabos Mar 27 '15 at 12:20

1 Answers1

0

I do not know what database you are using, but maybe you could switch?

I suggest using Impala + AVRO schema. You will probably need to refresh/create table using HIVE, as Impala lacks some functionalities in the administrative area.

I've used it storing files on HDFS and grouping and then summing 45GB of float took me about 40 seconds on 4 machines. You spend no time putting anything to database as the source are files themselves. All time you need is to store files in HDFS, but it's as fast as any FS.

szefuf
  • 500
  • 3
  • 14