Error when stripping punctuation from corpus

Question

Thank you in advance for your help. I'm trying to write a script that will look at a corpus, find all trigrams and print those along with their relative frequencies into a csv file. I have gotten pretty far but keep running into one problem. It thinks conjunctions are two words because of the apostrophe so it splits doesn't into doesn t, which messes up the trigram count. I am trying to solve that problem by removing all punctuation from the raw variable, which I believe is just one long string that contains all of the text from my corpus with this line:

    raw = raw.translate(None, string.punctuation)

But that gives me an error that says: NameError: name 'string' is not defined

But I didn't think string had to be defined when used like that? Does that mean raw is not a string? How can solve this?

#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt') 
print "Finished importing corpus"
tokenizer = RegexpTokenizer(r'\w+')
raw = speeches.raw().lower()
raw = raw.translate(None, string.punctuation)
finalwords = raw.encode['ascii','xmlcharrefreplace']
tokens = tokenizer.tokenize(finalwords)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
minscore = 40
numwords = len(finalwords)
print "Words in corpus:" 
print numwords
c = csv.writer(open("TPNngrams.csv", "wb"))
for k,v in fdist.items():
    if v > minscore:
        rf = Decimal(v)/Decimal(numwords)
        firstword, secondword, thirdword = k
        trigram = firstword + " " + secondword + " " + thirdword
        results = trigram,v,rf
        c.writerow(results)
        print firstword, secondword, thirdword, v, rf

print "All done."

oops!! Thank you. Did that but now I'm getting the following error: TypeError: translate() takes exactly one argument (2 given) — Jolijt Tamanaha, Oct 16 '14 at 19:06
Because you need to either use the `string.translate` module function that takes 2 parameters or the `translate` method of string that take one. Simply, remove the `None`. — dreyescat, Oct 16 '14 at 19:14
Did that and now it says: TypeError: character mapping must return integer, None or unicode — Jolijt Tamanaha, Oct 16 '14 at 19:17

score 0 · Answer 1 · answered Oct 16 '14 at 19:08

But I didn't think string had to be defined when used like that?

Like all the other modules in Python, you need to import string before it is used.

Does that mean raw is not a string?

Do not confuse the string module with the type string. Yes. Probably raw is of type string.

How can solve this?

Add import string at the beginning of the file.

score 0 · Accepted Answer · edited May 23 '17 at 10:33

0

Another option if you want to keep the apostrophes in the words

you don't necessarily have to split the apostrophes out. Just try changing your regular expression on your tokenizer to include apostrophes:

tokenizer = RegexpTokenizer(r'\w+')

try:

tokenizer = RegexpTokenizer(r'(\w|')+')

or also take a look at this response here it might be better:

Regex to match words and those with an apostrophe

edited May 23 '17 at 10:33

Community

1
1

answered Oct 16 '14 at 19:13

ZzCalvinzZ

151
1
7

One more question, do you happen to know how to write a regular expression that would just tokenize by whitespace? I can't figure it out – Jolijt Tamanaha Oct 16 '14 at 22:06
do you mean you want the regex to match anything until it hits whitespace? In that case you can use the "match anything up until" expression represented by "^". This expression should work ^\S* – ZzCalvinzZ Oct 17 '14 at 18:39

Ami hajimohammadi · Answer 3 · 2022-03-28T23:19:01.683

0

if you would like to use punctuation, you need to import punctuation as below:

python3: from string import punctuation

python2: import string

this link may help as well:

https://www.geeksforgeeks.org/string-punctuation-in-python/

edited Mar 28 '22 at 23:19

answered Mar 28 '22 at 22:49

Ami hajimohammadi

1
1

This is a duplicate of [this answer](https://stackoverflow.com/a/26412057/16775594). Please don't post duplicate answers. See [answer] for more information. – Sylvester Kruin Mar 30 '22 at 21:50

Error when stripping punctuation from corpus

3 Answers3

Linked