Python and nGrams

Question

Aster user here that is trying to move completely over to python for basic text analytics. I am trying to replicate the output of ASTER ngram in Python using nltk or some other module. I need to be able to do this for ngrams of 1 thru 4. Output to csv.

DATA:

Unique_ID, Text_Narrative

OUTPUT NEEDED:

Unique_id, ngram(token), ngram(frequency)

Example output:

023345 "I" 1
023345 "Love" 1
023345 "Python" 1

Hi, welcome to SO, can you include some code of what you attempted? What is the main issue? — Jan Sila, Aug 14 '17 at 15:19
We aren't a coding service. Please show us what you have done and where you are stuck. — Error - Syntactical Remorse, Aug 14 '17 at 15:21
you need couple of things - `open` or `csv.writer` for the file writing, then I would recommend `Counter` from `collections` and that's pretty much it. Do you want the frequency within the unique_ID string or altogether? — Jan Sila, Aug 14 '17 at 15:24
apologies- as I am new to Python and open source in general. In my research, I've discovered several different 'ways' to do ngrams in Python. So my question is which method would you recommend to mimmick the ASTER output (if you're familiar with ASTER).. — Josh Chilton, Aug 14 '17 at 16:30

Uri Goren · Accepted Answer · 2017-08-14T17:50:20.737

0

I wrote this simple version only with python's standard library, for educational reasons.

Production code should use spacy and pandas

import collections
from operator import itemgetter as at
with open("input.csv",'r') as f:
    data = [l.split(',', 2) for l in f.readlines()]
spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else []
unigrams = [(i,w) for i, d in data for w in d.split()]
bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:] )))
trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:])))
with open("output.csv", 'w') as f:
    for ngram in [unigrams, bigrams, trigrams]:
        counts = collections.Counter(ngram)
        for t,count in counts.items():
            f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1]))

edited Aug 14 '17 at 17:50

answered Aug 14 '17 at 15:23

Uri Goren

13,386
6
58
110

thanks Uri- this code has gotten me halfway there. Can you share the adjustment that would me to run an ngram of 2 words, 3 words, etc etc ? – Josh Chilton Aug 14 '17 at 16:48
I've added bigrams and trigrams calculation, Please accept the answer if it was helpful. If you have any additional requests, please ask a new question – Uri Goren Aug 14 '17 at 17:41

score 0 · Answer 2 · answered Nov 10 '17 at 23:32

As the others said the question is really vague but since you are new here's a long form guide. :-)

from collections import Counter

#Your starting input  - a phrase with an ID
#I added some extra words to show count
dict1 = {'023345': 'I love Python love Python Python'}


#Split the dict vlue into a list for counting
dict1['023345'] = dict1['023345'].split()

#Use counter to count
countlist = Counter(dict1['023345'])

#count list is now "Counter({'I': 1, 'Python': 1, 'love': 1})"

#If you want to output it like you requested, interate over the dict
for key, value in dict1.iteritems(): 
    id1 = key
    for key, value in countlist.iteritems():
        print id1, key, value

Python and nGrams

2 Answers2