0

I have list of sentences with million rows (N) and list of string list (M). I want to get matrix MxN that each element is how many occurances match list of string list in list of sentences with overlapped. For example:

sentence_list = ['Homegrown tech giant', 'GoTo gained 23 percent', 'at its Indonesia Stock Exchange']
list_of_string_list = [['homeg', 'goto'], ['to ga', 'gained', 'cents']]

and I want result array with 2x3 dimension like this:

[[1, 1, 0]  #match homeg, match goto, no match
 [0, 2, 0]] #no match, match to ga and gained, no match

how to do that in fast way using python?

1 Answers1

0

It's just brute force:

sentence_list = ['Homegrown tech giant', 'GoTo gained 23 percent', 'at its Indonesia Stock Exchange']
list_of_string_list = [['homeg', 'goto'], ['to ga', 'gained', 'cents']]

sentence_list = [x.lower() for x in sentence_list]

array = []
for sublist in list_of_string_list:
    row = []
    for sentence in sentence_list:
        count = 0
        for word in sublist:
            count += sentence.count(word)
        row.append( count )
    array.append( row )

print(array)
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30