2

I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]

I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.

Thanks!

Edit: for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist is just a list made from set of all found ngrams.

Edit2:

Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)

from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd

articles = ['New York City','Moscow','Beijing']
tokenizer  = nltk.tokenize.TreebankWordTokenizer()

data={'article':[],'treebank_tokenizer':[]}
for article in articles:
    data['article' ].append(wikipedia.page(article).content)
    data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))

df=pd.DataFrame(data)

df['ngrams-3']=df['treebank_tokenizer'].map(
    lambda x: [' '.join(t) for t in ngrams(x,3)])

ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))

df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Ilya
  • 561
  • 2
  • 17
  • Where did you get the function from? What is in the input `ngramlist`? What is the expected output? – alvas Apr 03 '18 at 02:31
  • What's the code before the line you've posted? – alvas Apr 03 '18 at 02:33
  • made an edit with the code to generate – Ilya Apr 03 '18 at 16:34
  • @alvas. What does it really matter? – Mad Physicist Apr 03 '18 at 16:44
  • 1
    @Ilya. Unfortunately, the code you showed does not really help us reproduce your actual data, so there is no purpose to it in the question. Fortunately though, I don't thing you really need it in the first place. – Mad Physicist Apr 03 '18 at 16:45
  • @Mad Physicist I've updated the code for a fully reproducible version (didn't do this before since I was using a different dataset that I couldn't share, but now just use wikipedia instead). I am pretty sure there is a faster way to make the df['freqlist'] and I want to find the way. Anyways, I will continue googling and post the answer when I find it. Edit: The current approach I am using is excruciatingly slow and it is all in that one line. – Ilya Apr 03 '18 at 16:57
  • I'm almost done with an answer. – Mad Physicist Apr 03 '18 at 16:57
  • It's because initially, you're doing some weird things like using `ngrams` in your import namespace and using `len(ngrams)` like it's a variable. – alvas Apr 03 '18 at 17:00
  • Your last line of code is `O(mn)` where m is the no. of features in `ngramlist` and n is the no. of rows you have. You need to do some proper vectorization. – alvas Apr 03 '18 at 17:02
  • @alvas the namespace is not a problem (since it's ngrams vs ngrams_) just in the original line i removed the _ to avoid confusion. if you look at the reproducible code there is no collision. I know my last line is O(m n), I am wondering how to improve it. – Ilya Apr 03 '18 at 17:06
  • Yes, the namespace IS a problem. Writing clear code helps you think about the problem better and explains the question better. BTW, writing an answer to answer all the rest of your code (it's a little problematic) =) – alvas Apr 03 '18 at 17:16
  • @alvas Ok thanks. The rest of the code is kind of thrown last minute to try to make a reproducible snippet but I am curious to hear what you think. – Ilya Apr 03 '18 at 17:28

2 Answers2

3

You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]

You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n2) one you have now. Remember, shorter code is not necessarily better or more efficient code:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]

To use this function properly, you would have to write a def function instead of a lambda:

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
2

Firstly, don't pollute your imported functions by overriding them and using them as variables, keep the ngrams name as the function, and use something else as variable.

import time
from functools import partial
from itertools import chain
from collections import Counter

import wikipedia

import pandas as pd

from nltk import word_tokenize
from nltk.util import ngrams

Next the steps before the line you're asking in the original question might be a little inefficient, you can clean them up, make them easier to read and measure them as such:

# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')

And then:

# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')

Then:

# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')

Finally, to the last line

# Instead of a set, we use a Counter here because 
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the 
# zeros in the vectors (aka dense vector), 
# you could actually get the non-zero sparse vectors 
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)
alvas
  • 115,346
  • 109
  • 446
  • 738
  • This addresses all the issues I left out. Very nice. – Mad Physicist Apr 03 '18 at 17:34
  • Thank you for the post. 1. yeah, the downloading articles thing I agree with, not sure why I wrote it the way I did, just wanted to output a snippet as fast as I could since people asked for it. 2. partial trigrams line actually doesn't do what the original line does since you need to do a ' '.join() on the tuple (because the real underlying code actually does something with the ngram strings). I guess I can use a lambda though. 3. The use of counter instead of list of set is interesting, are we guaranteed that the order is preserved for every application? – Ilya Apr 03 '18 at 17:36
  • If you're using ngram strings to do something and then vectorize, then you're doing it inefficiently. What truly matters at the end is the vector values (dense/sparse), the feature (key) itself, the machine learning algorithm don't really care as long as it's consistent. – alvas Apr 03 '18 at 17:38
  • @alvas I basically find intersects between sets of ngrams. As far as I know you can't do an intersect between two sets of sets (is that correct?), so I join() so I can compute intersects between sets of strings. To clarify, I have several labels and I want to only consider ngrams that occur in both label classes a sufficient amount of times. – Ilya Apr 03 '18 at 17:40
  • Yes, you can do intersection between sets =) And in fact the Pythonic syntax very "humanly-readable" `set(x).intersection(set(y))` – alvas Apr 03 '18 at 17:41
  • I understand the line above, this is what I do where x is a set of ngrams for label = 0 and y is a set of ngrams for label = 1. But I was under impression that x and y cannot be a set of sets (which is why I simply made each ngram set into a string with join). – Ilya Apr 03 '18 at 17:43
  • 1
    IMHO, don't trust strings, strings are higher-level construct, trust arrays. Strings is supposed to be an array of chars ;P – alvas Apr 03 '18 at 17:46
  • 1
    Ok, my bad. ngrams are tuples anyways which are hashable. set of sets are unhashable though, but I actually care about order so tuples are fine. – Ilya Apr 03 '18 at 17:47
  • Hope the answer helps. – alvas Apr 03 '18 at 17:49
  • @Ilya. A frozenset is quite hashable if that helps (for the same reason that a tuple is while list is not). – Mad Physicist Apr 03 '18 at 17:51
  • If I'm not wrong, any immutable containers should be hashable ;P List and set isn't because they're mutable. Basic types (non-containers) are treated differently with regards to mutability and hashability. – alvas Apr 03 '18 at 17:52
  • @alvas, ok so doing some timing I think the original snipper + mad physicist final line is about 1 second faster than your code. Part of the problem of testing this is that the download speed actually varies quite a bit (3 to 4 s) but I think on average both download snippets take same time to execute. – Ilya Apr 03 '18 at 18:10
  • Don't time the whole code. Benchmark each component. – alvas Apr 03 '18 at 18:33
  • And BTW, you don't need the intermediate step of `df['trigrams_count'] = ...` if you only need the normalized version =) – alvas Apr 03 '18 at 18:34