-1

I'm trying to simply check whether two sentences have any similar words.

Here's an example:

string_one = "Author: James Oliver"
string_two = "James Oliver has written this beautiful article which says...."

In this case, these two sentences match the criteria as they contain some common words.

I've tried a bunch of solutions and none seems to work properly. The two sentences would have a fairly large amount of words so splitting them into lists and finding the intersection would be really inefficient I think.

saran3h
  • 12,353
  • 4
  • 42
  • 54
  • 2
    Is it inefficient? You should never assume things are inefficient till you have tried them and found that it is indeed a significant performance problem. – mousetail Feb 21 '23 at 07:17

2 Answers2

1

We could convert the two strings into sets, and then check the intersection:

s1 = "Author: James Oliver"
s2 = "James Oliver has written this beautiful article which says...."
w1 = re.findall(r'\w+', s1)
w2 = re.findall(r'\w+', s2)
intersection = set(w1) & set(w2)

if len(intersection) > 0:
    print("Found common words: " + ' '.join(intersection))
else:
    print("No words in commmon")

This prints:

Found common words: James Oliver

Your current suggested approach is fine, and finding all words in each sentence will take some time.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

If you want to check if they contain "any" and you don't care about which ones or how many, you can use any. With the combination of both of these, the program will stop execution once it finds a word in common without 1) calculating the rest of the words (and storing them in a list) and 2) iterating over the unused words.

import re

s1 = "Author: James Oliver"
s2 = "James Oliver has written this beautiful article which says...."
words1 = set(re.findall(r'\w+', s1))
words2 = re.finditer(r'\w+', s2) # finditer evaluates lazily

if any(word2.group() in words1 for word2 in words2):
    print("Found words in common")
else:
    print("No words in common")
Samathingamajig
  • 11,839
  • 3
  • 12
  • 34