0

I built a custom function two compare two url's to get the longest common subsequence (lcs).

def lcs_dynamic(url1, url2): 
   maths: compare url1 with url2
   return lcs

I have a series s1 and a series s2 with a bunch of url's (13.000pcs). I want to compare each element of both series with each other (169.000.000 comparisons)

I did it with two nested for-loops, but it's way too slow.

for index1, value1 in s1.items():
    for index2, value2 in s2.items():
        url1 = value1
        url2 = value2
        if (index1 != index2):
            lcs1 = lcs_dynamic(url1, url2) //usage of my custom function
        overlap = lcs1 /len(url2)
        print({index1}, {index2}, {url1}, {url2}, {overlap})

Is there a better way to do it?

I thought about the apply() method, but I couldn't figure out how to get access to series2 and the second url as my custom function lcs_dynamic needs both urls as arguments

series1.apply(lcs_dynamic(url1, url2)) --> in this case I would get the url1 from series1 but how can get access to the series2 and url2... don't know.

Thanks in advance!

TGee
  • 41
  • 1
  • 6

2 Answers2

1

To summarize the comments above :

First, define the two dataframes containing the series:

df1 = pd.DataFrame({'url1' : ['url1/path1/subpath1/subpath2', 'url2/path2/subpath1/subpath2']})
df2 = pd.DataFrame({'url2' : ['url1/path1/subpath1', 'url2/path2/subpath1']})

Next, do a cross join to generate all the possible combinations:

df = df1.merge(df2, how='cross')

Next, apply the custom function:

df['lcs'] = df.apply(lambda row : lcs_dynamic(row['url1'], row['url2']), axis = 1)
df['overlap'] = df['lcs'] / df['url2'].str.len()
heretolearn
  • 6,387
  • 4
  • 30
  • 53
  • That worked perfectly! But I was missing my indexes from both series/ dataframes. I now transformed the indexes from both dataframes into columns and it worked out! I have now a dataframe with: index_df1 | url1 | index_df2 | url2 | lcs | overlapp Thanks a lot! – TGee Jun 02 '21 at 11:31
0

It worked fine so far. But how do i get rid of the duplicates. As i do a cross-join i will get for example

(1;1), (1;2), (1;3)
(2;1), (2;2), (2;3)

I want to remove the duplicates (1;1), (2;2) etc.. But as well the duplicates (1;2) and (2;1), cause they are for me the same.

TGee
  • 41
  • 1
  • 6