6

I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog and Cat. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. And so on.

I created a code for the logic above:

# pip install thefuzz
from thefuzz import fuzz

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}

df_test = pd.DataFrame(d_test)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    for index_, row_ in df_test.iterrows():
        if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
            if fuzz.ratio(row['name'], row_['name']) > 50:
                df_test.loc[index_,'id'] = int(i)
                is_i_used = True
    if is_i_used == True:
        i += 1
        is_i_used = False
                           

Code generates expected result:

    name        cluster_number id
0   South Beach 1              1
1   Dog         2              2
2   Bird        3              3
3   Ant         3              4
4   Big Dog     2              2
5   Beach       1              1
6   Dear        4              5
7   Cat         2              6

Note, for Cat we got id as 6 but it is fine because it is unique anyway.

While algorithm above works for test data I am not able to use it for real data that I have (about 1 million rows) and I am trying to understand how to vectorize the code and get rid of two for-loops.

Also thefuzz module has process function and it allows to process data at once:

from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))

But I don't see if it can help with speeding up the code.

illuminato
  • 1,057
  • 1
  • 11
  • 33
  • how many nunique of df['name']? run `df['name'].nunique()` in your dataset and tell me – Panda Kim Dec 11 '22 at 09:17
  • @PandaKim `df['name']` has `419776` unique values – illuminato Dec 11 '22 at 09:28
  • You have not disclosed `df['cluster_number'].nunique()`, which seems relevant. Guessing from your example data, I am going to imagine that it is "large", or put another way, that names per cluster is "small". – J_H Dec 11 '22 at 22:28
  • @J_H max(df['cluster_number'].nunique()) is about 200. mean is about 10. – illuminato Dec 12 '22 at 02:55

4 Answers4

6

tl;dr: Avoid O(N^2) running time if N is big.

help with speeding up the code.

People get down on .iterrows(), calling it "slow".

Switching from .iterrows to a vectorized approach might "speed things up" somewhat, but that's a relative measure. Let's talk about complexity.

time complexity

Your current algorithm is quadratic; it features a pair of nested .iterrows loops. But then immediately we filter on

        if different_cluster and not_yet_assigned:

Now, that could be workable for "small" N. But an N of 400K quickly becomes infeasible:

>>> 419_776 ** 2 / 1e9
176.211890176

One hundred seventy-six billion iterations (with a "B") is nothing to sneeze your nose at, even if each filter step has trivial (yet non-zero) cost.

At the risk of reciting facts that have tediously been repeated many times before,

  • sorting costs O(N log N), and
  • N log N is very signicantly less than N^2

I'm not convinced that what you want is to "go fast". Rather, I suspect what you really want is to "do less". Start by ordering your rows, and then make a roughly linear pass over that dataset.

You didn't specify your typical cluster group size G. But since there's many distinct cluster numbers, we definitely know that G << N. We can bring complexity down from O(N^2) to O(N × G^2).


df = df_test.sort_values(['cluster_number', 'name'])

You wrote

for index, row in df_test.iterrows():
    for index_, row_ in df_test.iterrows():

Turn that into

for index, row in df.iterrows():
    while ...

and use .iloc() to examine relevant rows.

The while loop gets to terminate as soon as a new cluster number is seen, instead of every time having to slog through hundreds of thousands of rows until end-of-dataframe is seen.

Why can it exit early? Due to the sort order.


A more convenient way to structure this might be to write a clustering helper.

def get_clusters(df):
    cur_num = -1
    cluster = []
    for _, row in df.iterrows():
        if row.cluster_number != cur_num and cluster:
            yield cluster
            cluster = []
        cur_num = row.cluster_number
        cluster.append(row)

Now your top level code can iterate through a bunch of clusters, performing a fuzzy match of cost O(G^2) on each cluster.

The invariant on each generated cluster is that all rows within cluster shall have identical cluster_number.

And, due to the sorting, we guarantee that a given cluster_number shall be generated at most once.


https://stackoverflow.com/help/self-answer

Please measure current running time, implement these suggests, measure again, and post code + timings.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • thanks for provided ideas. I posted my answer based on your recommendation. I still wonder if victoried option is possible for this. – illuminato Dec 12 '22 at 03:06
4

Attempt #1

Based on @J_H suggestions I made some changes in the original code:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}

df_test = pd.DataFrame(d_test)

df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index
    
    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1
        
    if is_i_used == True:
        i += 1
        is_i_used = False

Now instead of hours of computations it runs only 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows.

While it is significant improvement I still looking for vectorized option.

Attempt #2

I created vectorized version:

from rapidfuzz import process, fuzz
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")

but it is not possible to use for dataframe with 1 millions rows because cdist returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float. So for 1 million names, the result matrix would require 3.6 terrabytes of memory.

illuminato
  • 1,057
  • 1
  • 11
  • 33
  • Cool! I think you're looking for one of these two apply's, either https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html or https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html . Though I really do still advise that you adopt the `get_clusters` generator, as that would let you do a (small-ish) number of `.apply` vector operations on sub-problems. – J_H Dec 12 '22 at 03:13
3

Following on your own answer, you don't need to compute process.cdist on all the names, you are interested only on those on the same cluster.

To do so, you can iterate over groups:

threshold = 50
index_start = 0
groups = []

for grp_name, grp_df in df_test.groupby("cluster_number"):
    
    names = grp_df["name"]

    scores = pd.DataFrame(
        data = (process.cdist(names, names, workers=-1)),
        columns = names,
        index = names
    )

    x, y = np.where(scores > threshold)
    grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
        .groupby(level=0)
        .agg(frozenset)
        .drop_duplicates()
        .reset_index(drop=True)
        .assign(restaurant_id = lambda t: t.index + index_start)
        .explode("name")
    )
    

    index_start = grps_in_group["restaurant_id"].max()+1
    groups.append(grps_in_group)
    
df_test.merge(pd.concat(groups), on="name")

|    | name        |   cluster_number |   id |   restaurant_id |
|---:|:------------|-----------------:|-----:|----------------:|
|  0 | Beach       |                1 |    0 |               0 |
|  1 | South Beach |                1 |    0 |               0 |
|  2 | Big Dog     |                2 |    0 |               1 |
|  3 | Cat         |                2 |    0 |               2 |
|  4 | Dog         |                2 |    0 |               1 |
|  5 | Dry Fish    |                2 |    0 |               3 |
|  6 | Fish        |                2 |    0 |               3 |
|  7 | Ant         |                3 |    0 |               4 |
|  8 | Bird        |                3 |    0 |               5 |
|  9 | Dear        |                4 |    0 |               6 |

Yet I am not sure this is an improvement.

Now, transforming the loop functionality in a function we can use .groupby(...).apply(...), however we lose track of the consecutive index. To address that I am using a trick with pandas categorical type:

def create_restaurant_id(
    dframe: pd.DataFrame,
    threshold: int = 50
    ) -> pd.DataFrame:
    
    names = dframe["name"]

    scores = pd.DataFrame(
        data = (process.cdist(names, names, workers=-1)),
        columns = names,
        index = names
    )

    x, y = np.where(scores > threshold)
    
    grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
        .groupby(level=0)
        .agg(frozenset)
        .drop_duplicates()
        .reset_index(drop=True)
        .assign(restaurant_id = lambda t: t.index)
        .explode("name")
    )
    
    return grps_in_group

(df_test
 .groupby("cluster_number")
 .apply(create_restaurant_id)
 .reset_index(level=0)
 .assign(restaurant_id = lambda t: (
     t["cluster_number"].astype(str) + t["restaurant_id"].astype(str)
     ).astype("category").cat.codes
 )
)

In term of performance on my laptop, with such small dataframe, the two are almost identical.

FBruzzesi
  • 6,385
  • 3
  • 15
  • 37
  • Thanks for the solution. Is it possible to have more then one cluster per iteration? – illuminato Dec 19 '22 at 19:33
  • I am editing and adding to the answer to avoid the for loop. However you may consider other dataframe libraries that are multicore and work in parallel by default. – FBruzzesi Dec 20 '22 at 08:02
  • Thanks. While both of your answers do not lead to out-of-memory errors, the computation time for one million rows exceeds 30 minutes. It is much slower than a double for loop (it takes only 2 minutes to compute).  It is why I asked about the possibility of adding "batch size" for `"cluster_number"` to your first solution so it would be possible to iterate on more than one cluster per iteration (I think 2000 clusters per iteration would speed up computation and require only ~ 8GB of memory). – illuminato Dec 21 '22 at 04:18
0

I think you are thinking very analytically. Try this :
What I'm doing here is giving a non-repeating ID number (Details below).

import pandas as pd

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}

df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

# Does the word occur more than once ? (int)
repeat = 0

for i in range(df_test.shape[0]):
    heywtu = df_test[df_test['name'].str.contains(*df_test['name'][i].split())].index.values
    
    # 0 enters the special case, so we took it as 1 directly.
    if i == 0:
            df_test.loc[i,'id'] = i+1
    else : 
        # Does the word occur more than once?
        repeat += len(heywtu) == 2

        # Fill all id column with specific id number
        df_test.loc[i,'id'] = i - repeat


# Editing the id of people with the same name other than 0
    if (len(heywtu) == 2) & (i!=0):
        df_test.loc[i,'id'] = heywtu[0]
        continue
# Special case, If there only 2 values:
if (len(df_test['name'])==2):
    df_test.loc[1,'id'] =2

# For first d_test values
print(df_test.head(10)) 
>>>           name  cluster_number   id
>>> 0        Beach               1  1.0
>>> 1  South Beach               1  1.0
>>> 2      Big Dog               2  2.0
>>> 3          Cat               2  3.0
>>> 4          Dog               2  2.0
>>> 5          Ant               3  4.0
>>> 6         Bird               3  5.0
>>> 7         Dear               4  6.0

# For last d_test values
print(df_test.head(10))
>>>           name  cluster_number   id
>>> 0        Beach               1  1.0
>>> 1  South Beach               1  1.0
>>> 2      Big Dog               2  2.0
>>> 3          Cat               2  3.0
>>> 4          Dog               2  2.0
>>> 5     Dry Fish               2  4.0
>>> 6         Fish               2  4.0
>>> 7          Ant               3  5.0
>>> 8         Bird               3  6.0
>>> 9         Dear               4  7.0

# If there only 2 values
df_test.head()
>>>          name  cluster_number   id
>>> 0      Big Dog               1  1.0
>>> 1  South Beach               2  2.0

What is repeat? Well, if other strings contains Dog word its gonna be counted, like Dog and Big Dog , and substrack numbers with index number. I hope its gonna helpful for your problem.

The Lord
  • 72
  • 1
  • 7
  • thank you for the answer but output does not seems right. `id` for `Big Dog` and `Dog` should be the same – illuminato Dec 19 '22 at 03:03
  • Edited! Does it helpful? – The Lord Dec 19 '22 at 11:06
  • Thanks for edit but please consider the following input `d_test = {'name' : ['South Beach', 'Big Dog'], 'cluster_number' : [2, 1]} `. Your solution gives the same id `1` for both names. But they should be different because they belong to different clusters. – illuminato Dec 19 '22 at 15:45
  • I thought this code is for Big data but I added new case for value 2 – The Lord Dec 19 '22 at 16:39