How to create a binary list based on inclusion of list elements in another list

Question

Given two lists of words, dictionary and sentence, I'm trying to create a binary representation based on the inclusion of words of dictionary in the sentence such as [1,0,0,0,0,0,1,...,0] where 1 indicates that the ith word in the dictionary shows up in the sentence.

What's the fastest way I can do this?

Example data:

dictionary =  ['aardvark', 'apple','eat','I','like','maize','man','to','zebra', 'zed']
sentence = ['I', 'like', 'to', 'eat', 'apples']
result = [0,0,1,1,1,0,0,1,0,0]

Is there something faster than the following considering that I'm working with very large lists of approximately 56'000 elements in size?

x = [int(i in sentence) for i in dictionary]

yeah i have over 100,000 sentences each now represented as a list of words that comprise them. I now need to represent each of these sentence as boolean array where a boolean value of 1 at the ith index indicates that the ith word in the dicitionary, which I have previously created, exists in the sentence. — user2353644, May 06 '13 at 07:36
Using sets you can do this is `O(N)` time complexity, so 10**5 items are not an issue. BTW don't use the word `sentences` here, it's confusing, a `sentence` is a set of space separated words while your sentences list contains simple words. — Ashwini Chaudhary, May 06 '13 at 07:39
Note that the comment of @user2353644 above contradicts with what they initially wrote in the original post which led to some confusion in the answers regarding what should be converted to a set/iterated over. — Georgy, Jul 22 '20 at 14:34

score 1 · Answer 1 · answered May 06 '13 at 07:21

1

set2 = set(list2)
x = [int(i in set2) for i in list1]

answered May 06 '13 at 07:21

John La Rooy

295,403
53
369
502

You can't search `"hello"` in `"hello world"` using this. – Ashwini Chaudhary May 06 '13 at 07:22
@AshwiniChaudhary, My interpretation was that list2 was a list of words from the sentence. – John La Rooy May 06 '13 at 07:24
Your interpretation was quite right, his lists contains simple words not sentences. – Ashwini Chaudhary May 06 '13 at 07:46

Ashwini Chaudhary · Answer 2 · 2013-05-06T07:32:19.023

use sets, total time complexity O(N):

>>> sentence = ['I', 'like', 'to', 'eat', 'apples']
>>> dictionary =  ['aardvark', 'apple','eat','I','like','maize','man','to','zebra', 'zed']
>>> s= set(sentence)
>>> [int(word in s) for word in dictionary]
[0, 0, 1, 1, 1, 0, 0, 1, 0, 0]

In case your sentence list contains actual sentences not words then try this:

>>> sentences= ["foobar foo", "spam eggs" ,"monty python"]
>>> words=["foo", "oof", "bar", "pyth" ,"spam"]
>>> from itertools import chain

# fetch words from each sentence and create a flattened set of all words
>>> s = set(chain(*(x.split() for x in sentences)))

>>> [int(x in s) for x in words]
[1, 0, 0, 0, 1]

score 0 · Accepted Answer · answered May 06 '13 at 07:12

0

I would suggest something like this:

words = set(['hello','there']) #have the words available as a set
sentance = ['hello','monkey','theres','there']
rep = [ 1 if w in words else 0 for w in sentance ]
>>> 
[1, 0, 0, 1]

I would take this approach because sets have O(1) lookup time, that to check if w is in words takes a constant time. This results in the list comprehension being O(n) as it must visit each word once. I believe this is close to or as efficient as you will get.

You also mentioned creating a 'Boolean' array, this would allow you to simply have the following instead:

rep = [ w in words for w in sentance ]
>>> 
[True, False, False, True]

answered May 06 '13 at 07:12

HennyH

7,794
2
29
39

OP is iterating over `words` and searching in `sentences` list, plus the items in your `sentences` list are words not `sentences`. – Ashwini Chaudhary May 06 '13 at 07:14
No, the OP is iterating over the sentence, and then creating a representation of the sentence based on each word in the sentence either being in a dictionary of words or not. "**where a 1 indicates that the ith word in the dictionary shows up in the sentence.**" – HennyH May 06 '13 at 07:16
thanks for you help, I will definitely implements sets. Unfortunately, I will still have to apply this list comp over 100,000 times, as I have 100,000 sentences. I'm trying to see if there is a clever numpy way I can apply something similar to this list comp to a matrix of sentences where each row is a sentence – user2353644 May 06 '13 at 07:18
I don't believe you can escape from having to visit every word in each sentence. – HennyH May 06 '13 at 07:21
@HennyH see OP's sample data, he's iterating over words(dictionary) not sentences. – Ashwini Chaudhary May 06 '13 at 07:48

How to create a binary list based on inclusion of list elements in another list

3 Answers3

Linked

Related