I want to N-gram a set of strings in MxNet. Perferably, I would do something like TFIDF Vectorizing, but even a simple N-gram with count and feature limits would be fine. Is there a built in function for this? What would be the best approach?
Currently, I am computing it with Python,
def tfidf(str_list, ngram_width=3):
tf = {}
for s in str_list:
for start, end in zip(range(len(s) - ngram_width),
range(ngram_width, len(s))):
if s[start:end] not in tf:
tf[s[start:end]] = 0
tf[s[start:end]] += 1
idf = {}
for t in tf.keys():
cnt = 0
for s in str_list:
if t in s:
cnt += 1
idf[t] = len(str_list)/(cnt + 1.0)
return {t:tf[t]*idf[t] for t in tf.keys()}