0

I have a folder full of subfolders with text (.txt) files that look like this:

some random information here
ignore it

author: Lisa Smith

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sit amet leo quis risus viverra varius pretium sed nunc. Nullam vitae tempor nisl.

Quisque viverra interdum nibh, id malesuada magna scelerisque sit amet. Quisque sed arcu tempus, feugiat dolor at, convallis justo. Suspendisse euismod, metus non pretium pulvinar, odio eros rhoncus eros, eu scelerisque ex risus id mauris. Praesent id vulputate augue.

Aliquam erat volutpat. Pellentesque dignissim pharetra commodo. Vivamus risus leo, posuere eu odio eget, vestibulum auctor lorem. Aenean volutpat finibus lectus sed pretium. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ullamcorper mauris nec elit tempor, vitae finibus ante aliquam.

I want to create a CSV file that looks like this:

filename author text
/fullfilepathhere/ Lisa Smith Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sit amet leo quis risus viverra varius pretium sed nunc. Nullam vitae tempor nisl. Quisque viverra interdum nibh, id malesuada magna scelerisque sit amet. Quisque sed arcu tempus, feugiat dolor at, convallis justo. Suspendisse euismod, metus non pretium pulvinar, odio eros rhoncus eros, eu scelerisque ex risus id mauris. Praesent id vulputate augue. Aliquam erat volutpat. Pellentesque dignissim pharetra commodo. Vivamus risus leo, posuere eu odio eget, vestibulum auctor lorem. Aenean volutpat finibus lectus sed pretium. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ullamcorper mauris nec elit tempor, vitae finibus ante aliquam.

This is the code I currently have, which I cobbled together from previous question and several other posts:

from glob import glob
import os
import re
import csv
import nltk

path = '**/*.txt'

def extract_fields(fname):
    with open(fname) as f:
        author, txt = "", ""

        for line in f:
            line = line.strip()
            if line.startswith("author: "):
                author = line[8:]
                break

        next(f)  # discard the following blank line

        txt = f.read()

        return author, txt


rows = []
for fname in glob(path):
    author, txt = extract_fields(fname)
    rows.append([fname, author, txt])

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["filename", "author", "txt"])
    writer.writerows(rows)

I am getting the following error:

Traceback (most recent call last):
  File "print_text.py", line 28, in <module>
    author, txt = extract_fields(fname)
  File "print_text.py", line 19, in extract_fields
    next(f)  # discard the following blank line
StopIteration

Any guidance would be appreciated!

hy9fesh
  • 589
  • 2
  • 15

1 Answers1

0

The biggest issue I see when I look at your code is structural; there may be a problem with the regex, but where does text come from, and how do you iterate over it?

I suggest you write a function that takes a filename and returns the extracted author and text. The main body of your script now looks like:

def extract_fields(fname):
    ...
    return author, txt


rows = []
for fname in glob(...):
    author, txt = extract_fields(fname)
    rows.append([fname, author, txt])


with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["filename", "author", "txt"])
    writer.writerows(rows)

Inside extract_fields, you'll open the file at fname, perform the extraction, and return the extracted author and text.

If you know your regex is good and you like it, you can ignore the rest.

As for the mechanics of extracting, for something that looks even this simple I'd prefer to deal with the data in terms of individual lines of txt (I don't like multiline regex and try to avoid it).

I'd iterate the first few lines of the file till I found the anchor line, "author: ...". Once I've identified that line, I know how to get the author's name:

for line in f:
    line = line.strip()
    if line.startswith("author: "):
        author = line[8:]
        break

That loop will read-and-discard (stripped) lines till it finds a line starting with "author: ", extract the author's name, then break out of the loop.

Out of the loop, I know the next line is a blank line and can be discarded:

next(f)  # discard the following blank line

and the rest is the text I want:

txt = f.read()

Here's the complete function:

def extract_fields(fname):
    with open(fname) as f:
        author, txt = "", ""

        for line in f:
            line = line.strip()
            if line.startswith("author: "):
                author = line[8:]
                break

        next(f)  # discard the following blank line

        txt = f.read()

        return author, txt

With these two files:

file1.txt
=========
some random information here
ignore it

author: Lisa Smith

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

file2.txt
=========
foo

bar

author: Doug

Nulla sit amet leo quis risus viverra varius pretium sed nunc.
Nullam vitae tempor nisl.

I get this output.csv:

+-----------+------------+----------------------------------------------------------------+
| filename  | author     | txt                                                            |
+-----------+------------+----------------------------------------------------------------+
| file1.txt | Lisa Smith | Lorem ipsum dolor sit amet, consectetur adipiscing elit.       |
|           |            |                                                                |
+-----------+------------+----------------------------------------------------------------+
| file2.txt | Doug       | Nulla sit amet leo quis risus viverra varius pretium sed nunc. |
|           |            | Nullam vitae tempor nisl.                                      |
+-----------+------------+----------------------------------------------------------------+
Zach Young
  • 10,137
  • 4
  • 32
  • 53
  • Based on your suggestions, I posted my new code and the issue I'm encountering. – hy9fesh Mar 21 '23 at 15:51
  • Hi. I read your updated question and I don't see you calling extract_fields() in a loop, like I showed in the first "overview" block of code. – Zach Young Mar 21 '23 at 15:59
  • I just updated the code and posted the new error. – hy9fesh Mar 21 '23 at 20:32
  • Something isn't in your file that you're expecting there to be. Wrap the call to extract_fields() in a try/catch block and print exceptions and filenames. You could also add a check after the "author loop" to see if author is empty and raise an exception, or do something that makes sense for your process. – Zach Young Mar 21 '23 at 21:01