1

I am quite new with Python that I try to learn for basic text analysis, topic modelling etc.

I wrote the following code for cleaning my text file. I prefer pywsed.utils lemmatize.sentence() function to NLTK's WordNetLemmatizer() because it produces cleaner texts. The following code works fine with sentences:

from nltk.corpus import stopwords
from pywsd.utils import lemmatize_sentence
import string

s = "Dew drops fall from the leaves. Mary leaves the room. It's completed. Hello. This is trial. We went home. It was easier. We drank tea. These are Demo Texts. Right?"

lemm = lemmatize_sentence(s)
print (lemm)

stopword = stopwords.words('english') + list(string.punctuation)
removingstopwords = [word for word in lemm if word not in stopword]
print (removingstopwords, file=open("cleaned.txt","a"))

But what I fail to do is lemmatizing a raw text file in a directory. I guess lemmatize.sentence() only requires strings?

I manage to read contents of a file with

with open ('a.txt',"r+", encoding="utf-8") as fin:
    lemm = lemmatize_sentence(fin.read())
print (lemm)

but this time the code fails to remove some keywords like "n't", "'ll", "'s", or "‘" and punctuations which result in an uncleaned text.

1) What do I do wrong? Should I tokenize first? (I also failed to feed lemmatize.sentence() with its results).

2) How do I get the output file content without any formatting (words without single quotes and bracket)?

Any help is greatly appreciated. Thanks in advance.

1 Answers1

0

Simply apply lemmatize to each line, one-by-one, and then append that to a string with a new line. So essentially, it's doing the same thing. Except doing each line, appending it to a temp string and seperating each by a new line, then at the end we print out temp string. You can use the temp string at the end as final output.

my_temp_string = ""
with open ('a.txt',"r+", encoding="utf-8") as fin:
    for line in fin:
        lemm = lemmatize_sentence(line)
        my_temp_string += f'{lemm} \n'
print (my_temp_string)
DUDANF
  • 2,618
  • 1
  • 12
  • 42
  • 1
    Please, consider adding a brief explanation to your answer so that the question author can better understand what you did – Vasilisa Nov 11 '19 at 12:24
  • thanks but not solved my problem.It gives an error. Traceback (most recent call last): File "", line 4, in AttributeError: 'str' object has no attribute 'read' In any case, I also managed to read the content. no problem with that. My problem is with the second part of the code, with not being able to remove stopwords and punctuation. – Emrah Peksoy Nov 11 '19 at 12:57
  • Edited my answer, try now. Remove the `.reads()` from `lemmatize_sentence(line.reads())` – DUDANF Nov 11 '19 at 13:39
  • Thanks so much. I got the idea. It is fine now. – Emrah Peksoy Dec 16 '19 at 11:12