-1

Let's say I have these 2 pandas dataframes:

df_jkt

business name address
zap clinic kemang south jakarta
natasha beauty clinic ciracas east jakarta
erha apothecary tebet south jakarta
dr viona spkk west jakarta

df_tng

business name address
zap clinic bsd tangerang
natasha clinic maja tangerang
erha clinic bsd tangerang
erha ultimate cipaku tangerang

I want to detect the business name values and print the rows that have same values for both dataframes, so the desired output will be like this:

df_output

business name address
zap clinic tangerang
zap clinic south jakarta
erha tangerang
erha tangerang
erha south jakarta
natasha clinic tangerang
natasha beauty clinic east jakarta

I've tried using NLTK library with this code:

# Initialize NLTK and download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk_stopwords = set(stopwords.words('english'))

# Function to compute Jaccard index
def compute_jaccard_similarity(str1, str2):
    set1 = set(word_tokenize(str1.lower()))
    set2 = set(word_tokenize(str2.lower()))
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

# Function to compute TF-IDF cosine similarity
def compute_cosine_similarity(str1, str2):
    tfidf_vectorizer = TfidfVectorizer(stop_words = nltk_stopwords)
    tfidf_matrix = tfidf_vectorizer.fit_transform([str1, str2])
    cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
    return cosine_sim[0][0]

# Create a new DataFrame to store matching results
matching_results = []

# Iterate through both DataFrames
for _, jkt_row in df_jkt.iterrows():
    for _, tangerang_row in df_tng.iterrows():
        jaccard_similarity = compute_jaccard_similarity(jkt_row['business name'], tangerang_row['business name'])
        cosine_similarity_result = compute_cosine_similarity(jkt_row['business name'], tangerang_row['business name'])
        
        # You can adjust the threshold values based on your requirement
        if jaccard_similarity > 0.3 or cosine_similarity_result > 0.3:
            matching_results.append({
                'business name_jkt': jkt_row['business name'],
                'business name_tang': tangerang_row['business name'],
                'jaccard_similarity': jaccard_similarity,
                'cosine_similarity': cosine_similarity_result
            })

df_output = pd.DataFrame(matching_results)

df_output

but it returns this error: TypeError: 'numpy.float64' object is not callable

How to fix the code? Or maybe there is a simpler way to solve my problem?

desertnaut
  • 57,590
  • 26
  • 140
  • 166

0 Answers0