3

It is difficult to describe this on a heading but given these two DataFrames:

import pandas as pd
import numpy as np
import re


df1 = pd.DataFrame({
'url': [
  'http://google.com/car', 
  'http://google.com/moto', 
  'http://google.com/moto-bike'
], 'value': [3, 4, 6]})

url                           value
http://google.com/car         3
http://google.com/moto        4
http://google.com/moto-bike   6

df2 = pd.DataFrame({'name': ['car','moto','bus']})

  name
0 car
1 moto
2 bus

I want to see how many times the name on df2 appears on the url for df1, and have sort of managed with:

df2['instances'] = pd.Series([df1.url.str.contains(fr'\D{w}\D', regex=True) \
.sum() for w in df2.name.tolist()])

For some reason car has zero instances cause there is only one.

   name  instances
0   car          0
1  moto          2
2   bus          0

What I would like to be able to do is to have another column that sums the value column of all matches of df1, so it looks like this:

   name  instances  value_total
0   car          1           3
1  moto          2          10
2   bus          0           0

Any help on the right direction would be greatly appreciated, thanks!

Álvaro
  • 2,255
  • 1
  • 22
  • 48
  • Will all of the urls be from the same domain, or at least of the same format? If so, you could use regex over the url columns to strip out the portion you want, and then it would be pretty easy to compare between the two dataframes with pandas built-ins. – whege Jan 17 '20 at 16:48
  • Yes, same domain but not necessarily the same path, the regex I have seems to be good enough to identify the wanted row – Álvaro Jan 17 '20 at 16:49
  • why do you do `\D` before and after the words? I understand the meaning in regex, just wondering about the reason in the url finding? – Ben.T Jan 17 '20 at 16:52
  • I want to remove any digit or character before and after the `name` value that could be present on the url so \car or -car matches car but carousel or motocar do not – Álvaro Jan 17 '20 at 16:56
  • I think if that is what you want, then you need a repetition qualifier. Either * (0 or more) or ? (0 or 1) would work depending on how many special characters there could be. – whege Jan 17 '20 at 16:57

2 Answers2

3

try with str.extract then merge and groupby with named aggregation (new in pandas 0.25+):

pat = '|'.join(df2['name']) #'car|moto|bus'
m = df2.merge(df1.assign(name=df1['url']
            .str.extract('('+ pat + ')', expand=False)),on='name',how='left')
m = m.groupby('name',sort=False).agg(instances=('value','count')
                 ,value_total=('value','sum')).reset_index()

print(m)

   name  instances  value_total
0   car          1          3.0
1  moto          2         10.0
2   bus          0          0.0
anky
  • 74,114
  • 11
  • 41
  • 70
  • I am getting `TypeError: aggregate() missing 1 required positional argument: 'arg'` – Álvaro Jan 17 '20 at 17:01
  • 1
    @Álvaro named aggregation is introduced in pandas version `0.25` , is it possible to [upgrade](https://stackoverflow.com/questions/22840449/how-to-update-pandas-from-anaconda-and-is-it-possible-to-use-eclipse-with-this-l) pandas version? – anky Jan 17 '20 at 17:01
  • I have upgraded, now it works well, many thanks, do you think this will play well with large amounts of data? – Álvaro Jan 17 '20 at 17:07
  • @Álvaro Any string operation is considerably slow than operating on numerics, however I believe this would be faster since there isnt any loops involved(all of the methods are quite vectorized) – anky Jan 17 '20 at 17:09
  • there is a little problem, with the pattern if I add carousel it will count it as car it should not count the `name` if is not `/`,'-`,or `_` only, that is why I did the regex \D{w}\D – Álvaro Jan 17 '20 at 17:15
  • @Álvaro I am not very good with `regex` however the way of achieving this will still be the same, try playing around with str.extract with your regex , make sure to have a patten with `()` , hope it works. – anky Jan 17 '20 at 17:17
  • 1
    nice as always, you can use `f-strings` instead of string concatenation =) – Umar.H Jan 17 '20 at 17:49
  • 1
    @Datanovice As shown in your answer. yes we can do that :) Thanks :) – anky Jan 17 '20 at 17:51
2

here's a similair version of anky's answer using .loc, groupby & merge

pat = '|'.join(df2['name'])
df1.loc[df1['url'].str.contains(f'({pat})'),'name'] = df1['url'].str.extract(f'({pat})')[0]

vals = (
    df1.groupby("name")
    .agg({"name": "count", "value": "sum"})
    .rename(columns={"name": "instance"})
    .reset_index()
)

new_df = pd.merge(df2,vals,on='name',how='left').fillna(0)

print(new_df)
   name  instance  value
0   car       1.0    3.0
1  moto       2.0   10.0
2   bus       0.0    0.0

edit, if you need an extact match of car then we can add word boundaries:

pat = r'|'.join(np.where(df2['name'].str.contains('car'),
                     r'\b' + df2['name'] + r'\b', df2['name']))
print(df1)
                          url  value 
0       http://google.com/car      3   
1     http://google.com/motor      4  
2  http://google.com/carousel      6  
3       http://google.com/bus      8 

df1.loc[df1['url'].str.contains(f'{pat}'),'name'] = df1['url'].str.extract(f'({pat})')[0]
print(df1)
                          url  value  name
0       http://google.com/car      3   car
1     http://google.com/motor      4  moto
2  http://google.com/carousel      6   NaN
3       http://google.com/bus      8   bus

if you want exact matches for all then just add word boundries to pattern :

pat = '|'.join(r'\b' + df2['name'] + r'\b')
#'\\bcar\\b|\\bmoto\\b|\\bbus\\b'
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • again with this one if the url says http://google.com/carousel it would count it as car but is not, that is why I did the regex \D{w}\D on str.contains – Álvaro Jan 17 '20 at 17:18
  • @Álvaro you can add a word boundry, but then `motor` would be exluded, is that okay? – Umar.H Jan 17 '20 at 17:42
  • It works but you manually added the word car, how to make it dynamic?, need exact match on all the strings on `df2` not just the word car – Álvaro Jan 17 '20 at 17:51
  • @Álvaro we can make it dynamic for _all_ the variables but as I said above, it would exclude motor from the lookup, hence my `where` statement, does that make sense? – Umar.H Jan 17 '20 at 17:53
  • Oh sorry, no the `motor` was wrong I have fixed the question I meant `moto` sorry – Álvaro Jan 17 '20 at 17:54
  • 1
    @Álvaro fixed for you - Anky_91's answer is better than mine so leave that as the correct one - the only diff is the minor regex – Umar.H Jan 17 '20 at 17:57