Let's say I have these 2 pandas dataframes:
df_jkt
business name | address |
---|---|
zap clinic kemang | south jakarta |
natasha beauty clinic ciracas | east jakarta |
erha apothecary tebet | south jakarta |
dr viona spkk | west jakarta |
df_tng
business name | address |
---|---|
zap clinic bsd | tangerang |
natasha clinic maja | tangerang |
erha clinic bsd | tangerang |
erha ultimate cipaku | tangerang |
I want to detect the business name
values and print the rows that have same values for both dataframes, so the desired output will be like this:
df_output
business name | address |
---|---|
zap clinic | tangerang |
zap clinic | south jakarta |
erha | tangerang |
erha | tangerang |
erha | south jakarta |
natasha clinic | tangerang |
natasha beauty clinic | east jakarta |
I've tried using NLTK library with this code:
# Initialize NLTK and download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk_stopwords = set(stopwords.words('english'))
# Function to compute Jaccard index
def compute_jaccard_similarity(str1, str2):
set1 = set(word_tokenize(str1.lower()))
set2 = set(word_tokenize(str2.lower()))
intersection = len(set1.intersection(set2))
union = len(set1.union(set2))
return intersection / union
# Function to compute TF-IDF cosine similarity
def compute_cosine_similarity(str1, str2):
tfidf_vectorizer = TfidfVectorizer(stop_words = nltk_stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform([str1, str2])
cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
return cosine_sim[0][0]
# Create a new DataFrame to store matching results
matching_results = []
# Iterate through both DataFrames
for _, jkt_row in df_jkt.iterrows():
for _, tangerang_row in df_tng.iterrows():
jaccard_similarity = compute_jaccard_similarity(jkt_row['business name'], tangerang_row['business name'])
cosine_similarity_result = compute_cosine_similarity(jkt_row['business name'], tangerang_row['business name'])
# You can adjust the threshold values based on your requirement
if jaccard_similarity > 0.3 or cosine_similarity_result > 0.3:
matching_results.append({
'business name_jkt': jkt_row['business name'],
'business name_tang': tangerang_row['business name'],
'jaccard_similarity': jaccard_similarity,
'cosine_similarity': cosine_similarity_result
})
df_output = pd.DataFrame(matching_results)
df_output
but it returns this error:
TypeError: 'numpy.float64' object is not callable
How to fix the code? Or maybe there is a simpler way to solve my problem?