0

I want to use numpy to speed up my computation where in I have a dictionary and I want to create a vector from it based on the presence of words as keys in the dictionary. I currently do this - a dummy example is provided for better understanding, the actual data is much larger:

self.bigram_freq =  {"a cat":3, "man child"2, "pokemon team":4} 
sentences = ['a boy ran over a cat with his bike yesterday afternoon']
for sentence in sentences:
    feature_vector = []
    # generate pairs from this sentence and see if in bigram_freq
    bigram_pairs = self.retrieve_pairs(sentence)
    dummy_dict = dict.fromkeys(self.bigram_freq, 0)
    for pair in bigram_pairs:
        if pair in self.bigram_freq:
            dummy_dict[pair] +=1
    feature_vector = list(dict(sorted(dummy_dict.items(), key=lambda item: item[0])).values())
            outputVector.append(feature_vector)

But due to the two loops, its a lot slower. I was wondering if this could be sped up using numpy and np.where. I was thinking of creating an array of np.zeros and then populating a specific index of the ndarray when the corresponding token (a pair from bigram_pairs) is present but I am unable to do so. Any help would be appreciated.

  • You should extend more on what are `self.bigram_freq` and `self.bigram_freq` for. Your script would not work as I run it. – mathfux Dec 02 '21 at 22:31
  • Okay so self.bigram_freq is a dictionary that has a count of every bigram. For example {"a cat":3, "man child"2, "pokemon team":4} – Abhiram Natarajan Dec 02 '21 at 22:43
  • 1
    So you should refactor your script in order to make a Minimal Working Example. This helps you to draw more attention of other OPs on your question. – mathfux Dec 02 '21 at 22:46

1 Answers1

1

I have attempted to fix your script like this in order to make it work:

outputVector = []
bigram_freq =  {"a cat":3, "man child":2, "pokemon team":4} 
sentences = ['a boy ran over a cat with his bike yesterday afternoon', 
             'he was dreaming about pokemon team at the moment he hit a cat']
S = sentences[-1].split(' ')
bigram_pairs = [f'{x} {y}' for x,y in zip(S[:-1], S[1:])] 
>>> bigram_pairs
['he was', 'was dreaming', 'dreaming about', 'about pokemon', 'pokemon team', 'team at', 'at the', 'the moment', 'moment he', 'he hit', 'hit a', 'a cat']

Now, in Python you do it like this:

for sentence in sentences:
    S = sentence.split(' ')
    bigram_pairs = [f'{x} {y}' for x,y in zip(S[:-1], S[1:])]
    dummy_dict = dict.fromkeys(bigram_freq, 0)
    for pair in bigram_pairs:
        if pair in bigram_freq:
            dummy_dict[pair] +=1
    feature_vector = list(dict(sorted(dummy_dict.items(), key=lambda item: item[0])).values())
    outputVector.append(feature_vector)        
>>> outputVector
[[1, 0, 0], [1, 0, 1]]

And you want to make it faster. Now take a look at this question. You don't actually need to create a list of all pairs because numpy allows you to check if a specific pair is in a sentence:

outputVector = []
match_with = list(bigram_freq)
for sentence in sentences:
    feature_vector = np.core.defchararray.find(sentence, match_with)!=-1
    outputVector.append(feature_vector)
>>> outputVector
[array([ True, False, False]), array([ True, False,  True])]
mathfux
  • 5,759
  • 1
  • 14
  • 34
  • Hi, thanks for your answer. For what ever reason though, this seems to be a lot slower than before with the np.core.defchararray in place. Note I additionally added an np.where to bring it to the 0,1 output form and I dont think that should be a bottleneck – Abhiram Natarajan Dec 03 '21 at 02:37
  • This is not a bottleneck for sure. You could just use `np.arrat(outputVector).astype(int)`. Note that `numpy` is not designed for string operations but there might be a faster alternatives of `np.core.defchararray.find`. – mathfux Dec 03 '21 at 02:47