0

Thank you in advance for your help. I'm trying to write a script that will look at a corpus, find all trigrams and print those along with their relative frequencies into a csv file. I have gotten pretty far but keep running into one problem. It thinks conjunctions are two words because of the apostrophe so it splits doesn't into doesn t, which messes up the trigram count. I am trying to solve that problem by removing all punctuation from the raw variable, which I believe is just one long string that contains all of the text from my corpus with this line:

    raw = raw.translate(None, string.punctuation)

But that gives me an error that says: NameError: name 'string' is not defined

But I didn't think string had to be defined when used like that? Does that mean raw is not a string? How can solve this?

#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt') 
print "Finished importing corpus"
tokenizer = RegexpTokenizer(r'\w+')
raw = speeches.raw().lower()
raw = raw.translate(None, string.punctuation)
finalwords = raw.encode['ascii','xmlcharrefreplace']
tokens = tokenizer.tokenize(finalwords)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
minscore = 40
numwords = len(finalwords)
print "Words in corpus:" 
print numwords
c = csv.writer(open("TPNngrams.csv", "wb"))
for k,v in fdist.items():
    if v > minscore:
        rf = Decimal(v)/Decimal(numwords)
        firstword, secondword, thirdword = k
        trigram = firstword + " " + secondword + " " + thirdword
        results = trigram,v,rf
        c.writerow(results)
        print firstword, secondword, thirdword, v, rf

print "All done."
lennon310
  • 12,503
  • 11
  • 43
  • 61
Jolijt Tamanaha
  • 333
  • 2
  • 9
  • 23

3 Answers3

0

But I didn't think string had to be defined when used like that?

Like all the other modules in Python, you need to import string before it is used.

Does that mean raw is not a string?

Do not confuse the string module with the type string. Yes. Probably raw is of type string.

How can solve this?

Add import string at the beginning of the file.

dreyescat
  • 13,558
  • 5
  • 50
  • 38
0

Another option if you want to keep the apostrophes in the words

you don't necessarily have to split the apostrophes out. Just try changing your regular expression on your tokenizer to include apostrophes:

tokenizer = RegexpTokenizer(r'\w+')

try:

tokenizer = RegexpTokenizer(r'(\w|')+')

or also take a look at this response here it might be better:

Regex to match words and those with an apostrophe

Community
  • 1
  • 1
ZzCalvinzZ
  • 151
  • 1
  • 7
  • One more question, do you happen to know how to write a regular expression that would just tokenize by whitespace? I can't figure it out – Jolijt Tamanaha Oct 16 '14 at 22:06
  • do you mean you want the regex to match anything until it hits whitespace? In that case you can use the "match anything up until" expression represented by "^". This expression should work ^\S* – ZzCalvinzZ Oct 17 '14 at 18:39
0

if you would like to use punctuation, you need to import punctuation as below:

python3: from string import punctuation

python2: import string

this link may help as well:

https://www.geeksforgeeks.org/string-punctuation-in-python/

  • This is a duplicate of [this answer](https://stackoverflow.com/a/26412057/16775594). Please don't post duplicate answers. See [answer] for more information. – Sylvester Kruin Mar 30 '22 at 21:50