0

Have a bunch of .txt files in the folder. Here are two functions which are using for reading these files and saving them into a variable as one string:

s=(glob.glob("/Users/user/documents/folder/*.txt"))

def read_files(files):
    for filename in files:
        with open(filename, 'r', encoding='latin-1') as file:
            yield file.read()

def read_files_as_string(files, separator='\n'):
    files_content = list(read_files(files=files))
    return separator.join(files_content)

results=read_files_as_string(s)

Now my idea to use sklearn's CountVectorizer() for getting n-grams from the text. But CountVectorizer() does not receive as input the string. So my question would be- how can I make the function for reading the files not to storing them into one string but store them using that logic: ['text1.txt', 'text2.txt', ..., 'textn.txt']

Thanks in advance!

Keithx
  • 2,994
  • 15
  • 42
  • 71
  • Have I understood correctly that you want the result to be like `["contents of text1.txt", "contents of text2.txt", …]`, not the filenames as your question shows? – Aankhen Jul 09 '18 at 11:13
  • fully correct. not the names but the contexts like you mentioned: ["contents of text1.txt", "contents of text2.txt", …] – Keithx Jul 09 '18 at 13:51

1 Answers1

1

read_files already does almost all of what you want. You can call it directly and use list to convert it from a generator into a regular list:

results = list(read_files(s))
Aankhen
  • 2,198
  • 11
  • 19