Parse a very large CSV dataset

Question

I have a very large CSV dataset (900M records) that consists of the following format:

URL | IP | ActivityId

Example data:

http://google.com/ | 127.0.0.1 | 2
http://google.com/ | 12.3.3.1 | 2

For this format, I wish to get all the unique activities per URL.

What I tried to do was create a dictionary where the key is the URL, and the value is a set of unique activities. However, this fails miserably performance wise - it eats up all the RAM and is very slow time-wise ( O(n) operation )

Is there any other faster approach?

score 0 · Answer 1 · answered May 10 '17 at 07:36

I'm assuming that you have in fact already loaded the dataset into memory (let's say it's stored in a variable named df), and run into trouble when you tried to "get all the unique activities per URL".

If you just want the number of unique activities per URL, use the .groupby() method:

df.groupby('URL').ActivityId.nunique()

If you also want all the ActivityIds, use .drop_duplicates():

df[['URL','ActivityId']].drop_duplicates()

score 0 · Answer 2 · edited May 23 '17 at 10:31

First you have to get clear on your requirements. If you have 900 MB of data; and you intend to push all that data into a single dictionary - then you shouldn't be surprised to find that you will need 900 MB of RAM in the end.

In other words: if you want to process all data in memory at the same time, all data needs to be in memory.

In order to speed up processing: I would go in and sort your CSV input first (on the URL column); for example using awk. Now you can read that file line by line; but you know that all rows with identical URLs will be showing up as "blocks". Meaning: now you don't need to read all URLs in memory; you can process them one after the other instead.

score 0 · Accepted Answer · answered May 10 '17 at 07:40

0

In my opinion you can use dask.

then there is same solution as pandas:

df.groupby('URL')['ActivityId'].nunique()

answered May 10 '17 at 07:40

jezrael

822,522
95
1,334
1,252

What about selecting those activities that are only unique to the given URL, and do not appear in other URLs? – kjanko May 10 '17 at 07:59
It is more complicated, not sure if dask support it. But give me a time, I try find solution. – jezrael May 10 '17 at 08:03
It is really hard, maybe better is post new question with sample and desired output, also some code what you try. Now i have no solution for it :( – jezrael May 10 '17 at 08:25
Will do in 50 min :) – kjanko May 10 '17 at 08:32
What I did was compare each entry from the dic to eachother and update the first entry's set with the difference of the two sets. But sinceI can't use the dic data structure I need to find another way. – kjanko May 10 '17 at 08:33

Parse a very large CSV dataset

3 Answers3

Linked