0

I am new to python programming. Right now i am doing natural language processing on text files. The problem is that i have around 200 text files so its very difficult to load each file individually and apply the same method.

here's my program:

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
with open("c:/users/user/desktop/datascience/sotu/stopwords.txt", 'r') as sww:
    sw = sww.read()
**with open("c:/users/user/desktop/datascience/sotu/a41.txt", 'r') as a411:
    a41 = a411.read()
    a41c=word_tokenize(str(a41))
    a41c = [w for w in a41c if not w in sw]**

so i want to apply this method on multiple files. Is there a way i can load all the files in one step and apply the same method. I tried this but it did not work:

import os
import glob
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
with open("c:/users/user/desktop/datascience/sotu/stopwords.txt", 'r') as sww:
    sw = sww.read()
for filename in glob.glob(os.path.join("c:/users/user/desktop/DataScience/sotu/",'*.txt')):
    filename=word_tokenize(str(filename))
    filename = [w for w in filename if not w in sw]
xqc=FreqDist(filename)

please help.

Learner27
  • 391
  • 1
  • 4
  • 13

1 Answers1

2

First and foremost, the second method does not work because you are not actually loading the files that you wish to inspect. In the first (presumably working example) you are calling word_tokenize on a string which represents a file's contents, in the second one you are doing it on the filename. Note, your code is really unclear here:

for filename in glob.glob(os.path.join("c:/users/user/desktop/DataScience/sotu/",'*.txt')): filename=word_tokenize(str(filename)) filename = [w for w in filename if not w in sw]

Do not use filename 3 times in 3 lines! The first use is only what it represents, the second represents a tokenized word list, and the third one represents the same word list but filtered!

As another hint, try giving your variables more descriptive names. I am not familiar with NLP, but someone looking through your code might want to know what xqc means.

Here's a snippet from which I hope you can deduce how to apply to your own code.

stopwords_filename = "words.txt"
stop_words = []
with open(stopwords_filename, "r") as stopwords_file:
    stop_words = stopwords_file.read()

words_input_dir = "c:/users/user/desktop/DataScience/sotu/"

for filename in os.listdir(words_input_dir):
    if filename.endswith(".txt"):
        with open(filename, "r") as input_file:
            input_tokens = word_tokensize(input_file.read())
            # Do everything else.`
Viktor Chynarov
  • 461
  • 2
  • 9