0

I have two lists: list message and list keyword. List message looks like this:

message = ["my name is blabla",'x-men is a good movie','i deny that fact']
keyword = ['x-men','name is','psycho movie']

I want to make a new list which contains keywords that are present in the message.

newList = []
for message_index in message:
    print(newList)
    for keyword in keywords:
        if search(r'\b{}\b'.format(keyword), message_index):
            newList.append(keyword)

My python code is above, the problem is each sentence in my message list is around 100 to 150 words and the length of the list is 3000. Each keyword maybe one or two words and the length of the list is 12,000.

So the search is taking a long time, is there a shorter way to do it?

This question is different because of the large amount of data in both list.

  • Do you want every occurring keyword only once? – user2390182 Nov 01 '17 at 13:08
  • You seem to be using `re.search`, not substring search. Which one do you really need? – AKX Nov 01 '17 at 13:31
  • Possible duplicate of [Python: Find a substring in a string and returning the index of the substring](https://stackoverflow.com/questions/21842885/python-find-a-substring-in-a-string-and-returning-the-index-of-the-substring) – Huseyin Yagli Nov 01 '17 at 13:41

3 Answers3

2

With built-in any() function:

To search by simple occurrence:

message = ["my name is blabla",'x-men is a good movie','i deny that fact']
keyword = ['x-men','name is','psycho movie']

result = [k for k in keyword if any(k in m for m in message)]
print(result)

The output:

['x-men', 'name is']

----------

If you need to search by exact words:

import re

message = ["my name is blabla",'x-men is a good movie','i deny that fact']
keyword = ['x-men','name is','psycho movie']

result = [k for k in keyword if any(re.search(r'\b{}\b'.format(k), m) for m in message)]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • And `any` will efficient break on the first `True` – dawg Nov 01 '17 at 13:27
  • I think this still doesn't solve the general performance issue. The complexity is still `O(M*N)` if not most keywords occur in most message to make the early break felt. – user2390182 Nov 01 '17 at 13:49
  • @schwobaseggl, I suppose, neither `set` object OR binary search would not be applicable in case of search with word boundaries. Feel free to present a more efficient approach – RomanPerekhrest Nov 01 '17 at 13:55
  • Nah, there are other algorithms, however, for finding each of a list of substrings in NLP. You can, e.g., transform the keyword list into a tree structure and maintain positions in the tree while traversing the text. But this seems beyond the scope of this thread :) see for instance https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm – user2390182 Nov 01 '17 at 13:58
  • @schwobaseggl, I can repeat my last sentence: *Feel free to present a more efficient approach* . P.S. The OP also wanted *is there a **shorter** way to do it?* – RomanPerekhrest Nov 01 '17 at 14:02
  • 1
    @RomanPerekhrest Agreed! I mean no offence. This is definitely a significant improvement over the OP's code. – user2390182 Nov 01 '17 at 14:23
  • No, no sarcasm. I added that comment (and a +1) as additional explanation why `any` is a good start. – dawg Nov 01 '17 at 15:03
1

You can significantly reduce the complexity of your keyword search by joining the list message into a delimited string and then searching for each keyword in that string:

>>> ms='\t'.join(message)
>>> [e for e in keyword if e in ms]
['x-men', 'name is']

The same method would work with a regex with the same benefit:

>>> [e for e in keyword if re.search(r'\b'+e+r'\b', ms)]

This reduces the complexity from O(M*N) to O(N)...

dawg
  • 98,345
  • 23
  • 131
  • 206
0

Try using a nested list comprehension

list = [key for key in keyword for word in message if key in word]
APorter1031
  • 2,107
  • 6
  • 17
  • 38