Extract unique data from a large dataset

Question

I have a very large CSV dataset (900M records) that consists of the following format:

URL | IP | ActivityId

Sample data:

http://google.com/ | 127.0.0.1 | 2
http://google.com/ | 12.3.3.1 | 1

For this format, I wish to get all the unique activities per URL, that do not appear in other URLs.

For example, let's add one more sample to the data I provided above

http://yahoo.com/ | 123.4.5.1 | 2

Now ActivityId 2 is totally excluded because It belongs to two urls: Google and Yahoo. So what I want is to find all the activities that belong to a single URL only, and I wish to know the URL they belong to.

What I tried to do:

Create a dictionary

URL => set(activity1, activity2, ... , activityN)

(This part is slow, and was answered here Parse a very large CSV dataset )

With this dictionary, I compared each entry to eachother and found the difference between the sets and updated the corresponding set with the difference result.

How can I accomplish what I want using pandas?

score 3 · Accepted Answer · answered May 10 '17 at 10:38

3

Another solution with nunique:

df.groupby('ActivityId')['URL'].filter(lambda x: x.nunique() == 1)

And faster solution with transform and boolean indexing:

df[df.groupby('ActivityId')['URL'].transform('nunique') == 1]

answered May 10 '17 at 10:38

jezrael

822,522
95
1,334
1,252

score 2 · Answer 2 · answered May 10 '17 at 10:33

2

In [21]: df.groupby('ActivityId')['URL'].filter(lambda x: len(x.unique()) == 1)
Out[21]:
1    http://google.com/
Name: URL, dtype: object

answered May 10 '17 at 10:33

MaxU - stand with Ukraine

205,989
36
386
419

score 0 · Answer 3 · answered May 10 '17 at 11:18

Here is a solution in case you also want to know the ActivityID that belong to a single URL:

Imagine you have the following dataset:

    URL                 IP          ActivityId
0   http://google.com/  127.0.0.1   2
1   http://google.com/  12.3.3.1    1
2   http://yahoo.com/   123.4.5.1   2
3   http://yahoo.com/   123.4.5.1   5

You can do:

In[1]:groups = df.groupby('ActivityId')['URL']
      for name, group in groups:
          if group.size == 1:
              print (name,group.values[0])

Out[2]:1 http://google.com/ 
       5 http://yahoo.com/

Meanwhile using:

df.groupby('ActivityId')['URL'].filter(lambda x: len(x.unique()) == 1)

Yields:

1    http://google.com/ 
3     http://yahoo.com/ 
Name: URL, dtype: object

Extract unique data from a large dataset

3 Answers3