I am trying to extract 5 lines before and after a specific combination of keywords from several SEC 10-K filings and then export that data into Excel so that I can then further process it manually. Unfortunately I have to rely on the .txt format filings rather than the .html or .xblr ones because the latter are not always available. I already downloaded and partially cleaned the .txt files to remove unneeded tags.
In short, my goal is to tell python to loop through the downloaded .txt files (e.g. all those in the same folder or simply by providing a reference .txt list with all the file names), open each one, look for the the word "cumulative effect" (ideally combined with other keywords, see code below), extract 5 lines before and after it, and then export the output to an excel with the filename in column A and the extracted paragraph in column B.
Using this code I managed to extract 5 lines above and below the keyword "cumulative effect" for one .txt file (which you can find here, for reference). However I am still struggling with automating/looping the whole process and exporting the extracted text to Excel using pandas.
import collections
import itertools
import sys
from pandas import DataFrame
filing='0000950123-94-002010_1.txt'
#with open(filing, 'r') as f:
with open(filing, 'r', encoding='utf-8', errors='replace') as f:
before = collections.deque(maxlen=5)
for line in f:
if ('cumulative effect' in line or 'Cumulative effect' in line) and ('accounting change' in line or 'adoption' in line or 'adopted' in line or 'charge' in line):
sys.stdout.writelines(before)
sys.stdout.write(line)
sys.stdout.writelines(itertools.islice(f, 5))
break
before.append(line)
findings = {'Filing': [filing],
'Extracted_paragraph': [line]
}
df = DataFrame(findings, columns= ['Filing', 'Extracted_paragraph'])
export_excel = df.to_excel (r'/Users/myname/PYTHON/output.xlsx', index = None, header=True)
print (df)
Using this line of code I obtain the paragraph I need, but I only managed to export the single line in which the keyword is contained to excel and not the entire text. This is the python output and this is the exported text to Excel.
How do I go about creating the loop and properly exporting the entire paragraph of interest into excel? Thanks a lot in advance!!