Check if two strings contain the same set of words in Python

Question

I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function from collections module right now

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

It seems to work fine for a few lines, but my text file has more than 1000 and it never finishes executing. Is there any other way, something more efficient that would help me compute the result for the entire file?

EDIT:

I just need a replacement for the Counter method, something to replace it. And not any change in the implementation.

Do you need to distinguish duplicate words? Should `to to` match `to to to`? — Barmar, Jun 25 '19 at 20:26
If not, turn the list of words into a `set` and test if the two sets are equal. — Barmar, Jun 25 '19 at 20:27
Is there any reason why you don't want to change the implementation? — Eric Duminil, Jun 26 '19 at 07:27

DataCruncher · Answer 1 · 2019-06-25T21:46:59.160

3

Try something like

set(sentence.split(" ")) == set(line.split(" "))

Comparing set objects is faster than comparing counter. Both set and counter objects are basically sets, however when you use counter object for comparison, it has to compare both the keys and values whereas the set only has to compare keys.
Thank you Eric and Barmar for your inputs.

Your full code will look like

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

edited Jun 25 '19 at 21:46

answered Jun 25 '19 at 20:27

DataCruncher

850
7
11

There's really not much difference between a `set`, a `dict` and a `counter`. A set is basically a dict in which values are ignored. It's much better to use a `O(1)` or `O(n)` solution with counters than `O(n**2)` with sets. – Eric Duminil Jun 25 '19 at 20:49
I am sorry, I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster option – TheLastCoder Jun 25 '19 at 20:51
@EricDuminil Is there anything that I can use instead of sets? – TheLastCoder Jun 25 '19 at 20:57
@EricDuminil The counter solution has to compare both the keys and values, the set only has to compare keys. They're both O(n). – Barmar Jun 25 '19 at 21:01
@TheLastCoder: sets and counters are perfectly fine. You just need to find the correct keys and use sets the way they're supposed to be used : not iterating over every key. – Eric Duminil Jun 25 '19 at 21:02
@Barmar: The problem here is the structure of the code and that there's one unneeded loop. It doesn't make any noticeable difference at all if both keys and values are compared, or just the keys. OP's code seems to be at least O(n**2), but could be O(n). – Eric Duminil Jun 25 '19 at 21:16
@Eric I told you, this is not the actual code I am running, I am generating the vocab dictionary withing the loop, so I have to put it inside. I am really sorry for the confusion. Wait, let me edit the question. By far set has been the fastest implementation – TheLastCoder Jun 25 '19 at 21:25
@Eric Please check the code again. Hope it makes sense. Do ask me if you need any clarification. – TheLastCoder Jun 25 '19 at 21:30
Edited the solution. Thank you for all your inputs. – DataCruncher Jun 25 '19 at 21:47
Sorry for being blunt, but `Counter` wasn't the problem, and your code isn't the solution either. There simply shouldn't be two `for` loops. – Eric Duminil Jun 26 '19 at 07:29

Eric Duminil · Accepted Answer · 2019-06-26T12:28:23.763

You really don't need to use two loops.

Correct way to use dicts

Let's say you have a dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

Your code is basically equivalent to:

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

But the whole point of dict (and set, Counter, ...) is to be able to get the desired value directly:

my_dict['c']
#=> 3

If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

Refactored code

You just need to find a common signature between "Today is a good day!" and "Is today a good day?". One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple, string, frozenset). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

or even shorter:

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

This code should be much faster than what you've tried until now.

Yet another alternative

If you want to keep the original sentences in a list, you can use setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}

This actually works really fast. But could you elaborate on how I could make the above code faster. Just by changing the counter and use something else. Either user defined or in built function — TheLastCoder, Jun 25 '19 at 23:34
I lose the order of words when I create a dictionary with the strings as keys. Yes I am able to get the count of similar sentences but then I lose the original order — TheLastCoder, Jun 26 '19 at 12:18
@TheLastCoder: That's why I wrote the "more complex example". Anyway, there's a shorter version in "Yet another alternative". — Eric Duminil, Jun 26 '19 at 12:28
I understand how the dictionary works. What I want is to have dictionary keys that are already in the text with the count equal to the number of similar strings(similar means have the same set of words) — TheLastCoder, Jun 26 '19 at 15:10
@TheLastCoder: What would a key look like, for example for `"Today is a good day"`? — Eric Duminil, Jun 26 '19 at 15:17
yes! and it's values will be 2 since ['Today is a good day', 'Is today a good day'], — TheLastCoder, Jun 26 '19 at 15:29
@TheLastCoder: Yes what? What would be the common key for `['Today is a good day', 'Is today a good day']`? — Eric Duminil, Jun 26 '19 at 15:34
@TheLastCoder: All the information you need is inside `vocab = {'a day good is today': ['Today is a good day', 'Is today a good day'], 'a b c': ['a b c', 'c b a'], 'a a b c': ['a a b c']}`. If you have a sentence, you just need to call `vocab[sorted_words(sentence)]` to get all the sentences with the same words, or `vocab[sorted_words(sentence)][0]` to get the first one. — Eric Duminil, Jun 27 '19 at 09:26

Markus Rother · Answer 3 · 2019-06-25T20:50:54.327

2

To take duplicate/multiple words into account your equality comparison may be:

def hash_sentence(s):                                                                                                                                                                                                                                         
    return hash(''.join(sorted(s.split())))                                                                                                                                                                                                                   

a = 'today is a good day'                                                                                                                                                                                                                                     
b = 'is today a good day'                                                                                                                                                                                                                                     
c = 'today is a good day is a good day'                                                                                                                                                                                                                       

hash_sentence(a) == hash_sentence(b)  # True
hash_sentence(a) == hash_sentence(c)  # False

Also, note that in your implementation every sentence is counted n-times (for sentence in vocab:).

edited Jun 25 '19 at 20:50

answered Jun 25 '19 at 20:28

Markus Rother

414
3
10

I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster option – TheLastCoder Jun 25 '19 at 20:51
It's probably the way to go. You can then group the sentences by hash and get similar sentences directly. – Eric Duminil Jun 25 '19 at 21:11
This is an elegant solution. Let me use the timeit function and get back to you! let me see if the set function or this implementation is faster – TheLastCoder Jun 25 '19 at 21:14
Time for this implementation : 31.8790001869, now running the plain old set implementation – TheLastCoder Jun 25 '19 at 21:19
time taken using sets : 15.1610000134 – TheLastCoder Jun 25 '19 at 21:21
1

that is interesting... try to tuple the sorted sequence instead of joining to a string if you would... – Markus Rother Jun 25 '19 at 21:23
1

actually, depending on your input data, you may be able to even omit the `.split()` and just sort the string directly -- for the purpose of hashing that is. – Markus Rother Jun 25 '19 at 21:27
Its just plain text file with over 1000 lines of random text. Yeah, I can try that – TheLastCoder Jun 25 '19 at 21:29
Well it certainly performed faster than using join. time taken 26.7250001431 still slower than sets :/ – TheLastCoder Jun 25 '19 at 21:33
Out of despair a last one: `sum(hash(x) for x in s.split())`, else: built-ins for the win!!! – Markus Rother Jun 25 '19 at 21:37
@TheLastCoder: Could you please check https://stackoverflow.com/a/56761941/6419007 and see if it works for you ? Load all your file with `sentences = file_ob.readlines()` and run the code. I think it should complete in a few ms. – Eric Duminil Jun 25 '19 at 21:51
@MarkusRother: Don't despair yet. Your code and idea are fine. I really think it's the way OP uses them that's the problem here. – Eric Duminil Jun 25 '19 at 21:52
@MarkusRother: time taken 41.7990000248 :/ – TheLastCoder Jun 25 '19 at 21:58
@TheLastCoder wait, let me try it. Sorry for not responding. – TheLastCoder Jun 25 '19 at 21:58
I am just going with sets. @MarkusRother, I have saved your code for referencing in the future. – TheLastCoder Jun 25 '19 at 23:38

score 2 · Answer 4 · answered Jun 25 '19 at 20:34

In your code, you can extract the Counter construction outside of the inner loop, instead of recalculating each for every pair - this should improve the algorithm by a factor proportional to the avg # of tokens per string.

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}

vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }

for line in file_obj:
    line_counter = Counter(line.split(" "))
    for sentence in vocab:
        if vocab_counter[sentence] == line_counter:
            vocab[sentence]+=1

Further improvements could be had by using the Counters as indices to a dictionary, which would let you replace the linear search for matching sentences with a lookup. The frozendict package would probably be useful so that you can use a dictionary as a key to another dictionary.

I am sorry, I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster option — TheLastCoder, Jun 25 '19 at 20:46

Check if two strings contain the same set of words in Python

4 Answers4

Correct way to use dicts

Refactored code

Yet another alternative