I have a folder full of subfolders with text (.txt) files that look like this:
some random information here
ignore it
author: Lisa Smith
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sit amet leo quis risus viverra varius pretium sed nunc. Nullam vitae tempor nisl.
Quisque viverra interdum nibh, id malesuada magna scelerisque sit amet. Quisque sed arcu tempus, feugiat dolor at, convallis justo. Suspendisse euismod, metus non pretium pulvinar, odio eros rhoncus eros, eu scelerisque ex risus id mauris. Praesent id vulputate augue.
Aliquam erat volutpat. Pellentesque dignissim pharetra commodo. Vivamus risus leo, posuere eu odio eget, vestibulum auctor lorem. Aenean volutpat finibus lectus sed pretium. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ullamcorper mauris nec elit tempor, vitae finibus ante aliquam.
I want to create a CSV file that looks like this:
filename | author | text |
---|---|---|
/fullfilepathhere/ | Lisa Smith | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sit amet leo quis risus viverra varius pretium sed nunc. Nullam vitae tempor nisl. Quisque viverra interdum nibh, id malesuada magna scelerisque sit amet. Quisque sed arcu tempus, feugiat dolor at, convallis justo. Suspendisse euismod, metus non pretium pulvinar, odio eros rhoncus eros, eu scelerisque ex risus id mauris. Praesent id vulputate augue. Aliquam erat volutpat. Pellentesque dignissim pharetra commodo. Vivamus risus leo, posuere eu odio eget, vestibulum auctor lorem. Aenean volutpat finibus lectus sed pretium. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ullamcorper mauris nec elit tempor, vitae finibus ante aliquam. |
This is the code I currently have, which I cobbled together from previous question and several other posts:
from glob import glob
import os
import re
import csv
import nltk
path = '**/*.txt'
def extract_fields(fname):
with open(fname) as f:
author, txt = "", ""
for line in f:
line = line.strip()
if line.startswith("author: "):
author = line[8:]
break
next(f) # discard the following blank line
txt = f.read()
return author, txt
rows = []
for fname in glob(path):
author, txt = extract_fields(fname)
rows.append([fname, author, txt])
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["filename", "author", "txt"])
writer.writerows(rows)
I am getting the following error:
Traceback (most recent call last):
File "print_text.py", line 28, in <module>
author, txt = extract_fields(fname)
File "print_text.py", line 19, in extract_fields
next(f) # discard the following blank line
StopIteration
Any guidance would be appreciated!