1

I have a very large CSV dataset (900M records) that consists of the following format:

URL | IP | ActivityId

Sample data:

http://google.com/ | 127.0.0.1 | 2
http://google.com/ | 12.3.3.1 | 1

For this format, I wish to get all the unique activities per URL, that do not appear in other URLs.

For example, let's add one more sample to the data I provided above

http://yahoo.com/ | 123.4.5.1 | 2

Now ActivityId 2 is totally excluded because It belongs to two urls: Google and Yahoo. So what I want is to find all the activities that belong to a single URL only, and I wish to know the URL they belong to.

What I tried to do:

Create a dictionary

URL => set(activity1, activity2, ... , activityN)

(This part is slow, and was answered here Parse a very large CSV dataset )

With this dictionary, I compared each entry to eachother and found the difference between the sets and updated the corresponding set with the difference result.

How can I accomplish what I want using pandas?

Community
  • 1
  • 1
kjanko
  • 552
  • 1
  • 5
  • 24

3 Answers3

3

Another solution with nunique:

df.groupby('ActivityId')['URL'].filter(lambda x: x.nunique() == 1)

And faster solution with transform and boolean indexing:

df[df.groupby('ActivityId')['URL'].transform('nunique') == 1]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2
In [21]: df.groupby('ActivityId')['URL'].filter(lambda x: len(x.unique()) == 1)
Out[21]:
1    http://google.com/
Name: URL, dtype: object
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

Here is a solution in case you also want to know the ActivityID that belong to a single URL:

Imagine you have the following dataset:

    URL                 IP          ActivityId
0   http://google.com/  127.0.0.1   2
1   http://google.com/  12.3.3.1    1
2   http://yahoo.com/   123.4.5.1   2
3   http://yahoo.com/   123.4.5.1   5

You can do:

In[1]:groups = df.groupby('ActivityId')['URL']
      for name, group in groups:
          if group.size == 1:
              print (name,group.values[0])

Out[2]:1 http://google.com/ 
       5 http://yahoo.com/ 

Meanwhile using:

df.groupby('ActivityId')['URL'].filter(lambda x: len(x.unique()) == 1)

Yields:

1    http://google.com/ 
3     http://yahoo.com/ 
Name: URL, dtype: object
VictorGGl
  • 1,848
  • 10
  • 15