How to Detect Similar Sentences from Different Dataframes?

Question

Let's say I have these 2 pandas dataframes:

df_jkt

business name	address
zap clinic kemang	south jakarta
natasha beauty clinic ciracas	east jakarta
erha apothecary tebet	south jakarta
dr viona spkk	west jakarta

df_tng

business name	address
zap clinic bsd	tangerang
natasha clinic maja	tangerang
erha clinic bsd	tangerang
erha ultimate cipaku	tangerang

I want to detect the business name values and print the rows that have same values for both dataframes, so the desired output will be like this:

df_output

business name	address
zap clinic	tangerang
zap clinic	south jakarta
erha	tangerang
erha	tangerang
erha	south jakarta
natasha clinic	tangerang
natasha beauty clinic	east jakarta

I've tried using NLTK library with this code:

# Initialize NLTK and download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk_stopwords = set(stopwords.words('english'))

# Function to compute Jaccard index
def compute_jaccard_similarity(str1, str2):
    set1 = set(word_tokenize(str1.lower()))
    set2 = set(word_tokenize(str2.lower()))
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

# Function to compute TF-IDF cosine similarity
def compute_cosine_similarity(str1, str2):
    tfidf_vectorizer = TfidfVectorizer(stop_words = nltk_stopwords)
    tfidf_matrix = tfidf_vectorizer.fit_transform([str1, str2])
    cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
    return cosine_sim[0][0]

# Create a new DataFrame to store matching results
matching_results = []

# Iterate through both DataFrames
for _, jkt_row in df_jkt.iterrows():
    for _, tangerang_row in df_tng.iterrows():
        jaccard_similarity = compute_jaccard_similarity(jkt_row['business name'], tangerang_row['business name'])
        cosine_similarity_result = compute_cosine_similarity(jkt_row['business name'], tangerang_row['business name'])
        
        # You can adjust the threshold values based on your requirement
        if jaccard_similarity > 0.3 or cosine_similarity_result > 0.3:
            matching_results.append({
                'business name_jkt': jkt_row['business name'],
                'business name_tang': tangerang_row['business name'],
                'jaccard_similarity': jaccard_similarity,
                'cosine_similarity': cosine_similarity_result
            })

df_output = pd.DataFrame(matching_results)

df_output

but it returns this error: TypeError: 'numpy.float64' object is not callable

How to fix the code? Or maybe there is a simpler way to solve my problem?

In this specific scenario, can't you just concatenate the dataframes and then sort them alphabetically by the `business name` column? Or isn't that possible in the real dataset — juanpethes, Aug 04 '23 at 09:11
@juanpethes that's not possible since the real case data do not appear as simple as the question — rainy days., Aug 04 '23 at 09:13
this line `cosine_similarity_result = compute_cosine_similarity(jkt_row['business name'], tangerang_row['business name'])` — rainy days., Aug 04 '23 at 09:36
Where do you define the function 'cosine_simiarity'? Are you sure this is not a float? — Simi, Aug 04 '23 at 13:49
Where exactly? Please update your post with the full error trace - see how to make a [mre]. — desertnaut, Aug 06 '23 at 08:24
not sure if thats causing the error but shouldn't `stop_words` should be either a `List` or `None`, you are passing a `set` instead ...? — Sudhir Bastakoti, Aug 06 '23 at 09:06

How to Detect Similar Sentences from Different Dataframes?

0 Answers0