How to load and apply same algorithm on multiple text files using python

Question

I am new to python programming. Right now i am doing natural language processing on text files. The problem is that i have around 200 text files so its very difficult to load each file individually and apply the same method.

here's my program:

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
with open("c:/users/user/desktop/datascience/sotu/stopwords.txt", 'r') as sww:
    sw = sww.read()
**with open("c:/users/user/desktop/datascience/sotu/a41.txt", 'r') as a411:
    a41 = a411.read()
    a41c=word_tokenize(str(a41))
    a41c = [w for w in a41c if not w in sw]**

so i want to apply this method on multiple files. Is there a way i can load all the files in one step and apply the same method. I tried this but it did not work:

import os
import glob
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
with open("c:/users/user/desktop/datascience/sotu/stopwords.txt", 'r') as sww:
    sw = sww.read()
for filename in glob.glob(os.path.join("c:/users/user/desktop/DataScience/sotu/",'*.txt')):
    filename=word_tokenize(str(filename))
    filename = [w for w in filename if not w in sw]
xqc=FreqDist(filename)

please help.

"...but it did not work"? any errors or something? and what is `filenamec`? — Marcin, Nov 26 '14 at 04:42
nothing happened ..and when i printed filename... it says 'filename' is not defined — Learner27, Nov 26 '14 at 04:44
what is `filenamec`? also your files have `*.text` as extensions or `*.txt` as in the first example? — Marcin, Nov 26 '14 at 04:45
its filename ... i tried creating another variable. The extension is .txt — Learner27, Nov 26 '14 at 06:18
@Marcin please refer to the first code . I am not sure about the syntax of the second code — Learner27, Nov 26 '14 at 06:26
Your glob iterates over `*.text`, but in the first example you have `*.txt`. So do you actually have files `*.text`? — Marcin, Nov 26 '14 at 06:52
have you read this: http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk/20922201#20922201 — alvas, Nov 26 '14 at 07:26

score 2 · Answer 1 · answered Nov 26 '14 at 08:34

First and foremost, the second method does not work because you are not actually loading the files that you wish to inspect. In the first (presumably working example) you are calling word_tokenize on a string which represents a file's contents, in the second one you are doing it on the filename. Note, your code is really unclear here:

for filename in glob.glob(os.path.join("c:/users/user/desktop/DataScience/sotu/",'*.txt')): filename=word_tokenize(str(filename)) filename = [w for w in filename if not w in sw]

Do not use filename 3 times in 3 lines! The first use is only what it represents, the second represents a tokenized word list, and the third one represents the same word list but filtered!

As another hint, try giving your variables more descriptive names. I am not familiar with NLP, but someone looking through your code might want to know what xqc means.

Here's a snippet from which I hope you can deduce how to apply to your own code.

stopwords_filename = "words.txt"
stop_words = []
with open(stopwords_filename, "r") as stopwords_file:
    stop_words = stopwords_file.read()

words_input_dir = "c:/users/user/desktop/DataScience/sotu/"

for filename in os.listdir(words_input_dir):
    if filename.endswith(".txt"):
        with open(filename, "r") as input_file:
            input_tokens = word_tokensize(input_file.read())
            # Do everything else.`

How to load and apply same algorithm on multiple text files using python

1 Answers1