I have a very large CSV dataset (900M records) that consists of the following format:
URL | IP | ActivityId
Example data:
http://google.com/ | 127.0.0.1 | 2
http://google.com/ | 12.3.3.1 | 2
For this format, I wish to get all the unique activities per URL.
What I tried to do was create a dictionary where the key is the URL, and the value is a set of unique activities. However, this fails miserably performance wise - it eats up all the RAM and is very slow time-wise ( O(n) operation )
Is there any other faster approach?