I have a very large CSV dataset (900M records) that consists of the following format:
URL | IP | ActivityId
Sample data:
http://google.com/ | 127.0.0.1 | 2
http://google.com/ | 12.3.3.1 | 1
For this format, I wish to get all the unique activities per URL, that do not appear in other URLs.
For example, let's add one more sample to the data I provided above
http://yahoo.com/ | 123.4.5.1 | 2
Now ActivityId 2 is totally excluded because It belongs to two urls: Google and Yahoo. So what I want is to find all the activities that belong to a single URL only, and I wish to know the URL they belong to.
What I tried to do:
Create a dictionary
URL => set(activity1, activity2, ... , activityN)
(This part is slow, and was answered here Parse a very large CSV dataset )
With this dictionary, I compared each entry to eachother and found the difference between the sets and updated the corresponding set with the difference result.
How can I accomplish what I want using pandas?