How to search for a combination of keywords in a text-file, extract lines above and below, and then export to Excel using pandas

Question

I am trying to extract 5 lines before and after a specific combination of keywords from several SEC 10-K filings and then export that data into Excel so that I can then further process it manually. Unfortunately I have to rely on the .txt format filings rather than the .html or .xblr ones because the latter are not always available. I already downloaded and partially cleaned the .txt files to remove unneeded tags.

In short, my goal is to tell python to loop through the downloaded .txt files (e.g. all those in the same folder or simply by providing a reference .txt list with all the file names), open each one, look for the the word "cumulative effect" (ideally combined with other keywords, see code below), extract 5 lines before and after it, and then export the output to an excel with the filename in column A and the extracted paragraph in column B.

Using this code I managed to extract 5 lines above and below the keyword "cumulative effect" for one .txt file (which you can find here, for reference). However I am still struggling with automating/looping the whole process and exporting the extracted text to Excel using pandas.

import collections
import itertools
import sys
from pandas import DataFrame

filing='0000950123-94-002010_1.txt'

#with open(filing, 'r') as f:
with open(filing, 'r', encoding='utf-8', errors='replace') as f:
    before = collections.deque(maxlen=5)
    for line in f:
        if ('cumulative effect' in line or 'Cumulative effect' in line) and ('accounting change' in line or 'adoption' in line or 'adopted' in line or 'charge' in line):
            sys.stdout.writelines(before)
            sys.stdout.write(line)
            sys.stdout.writelines(itertools.islice(f, 5))
            break
        before.append(line)

findings = {'Filing': [filing],
        'Extracted_paragraph': [line]
        }

df = DataFrame(findings, columns= ['Filing', 'Extracted_paragraph'])

export_excel = df.to_excel (r'/Users/myname/PYTHON/output.xlsx', index = None, header=True)

print (df)

Using this line of code I obtain the paragraph I need, but I only managed to export the single line in which the keyword is contained to excel and not the entire text. This is the python output and this is the exported text to Excel.

How do I go about creating the loop and properly exporting the entire paragraph of interest into excel? Thanks a lot in advance!!

Are you interested in the 5 lines before and after the first line ONLY in which the phrase appears? — Jack Fleeting, May 15 '19 at 16:03
@JackFleeting yes, basically I only care about the 11-line range with the line in which the keyword is contained in the middle (i.e. 5 lines, line w/ keyword, 5 lines). And Prune I apologize about that, I will try to make it more concise and clear — lnrd, May 15 '19 at 16:52

score 0 · Accepted Answer · answered May 15 '19 at 17:11

0

I believe your basic error was in

'Extracted_paragraph': [line]

which should have been

'Extracted_paragraph': [before]

So with some simplifying changes, the main section of you code should look like this:

with open(filing, 'r', encoding='utf-8', errors='replace') as f:
  before = collections.deque(maxlen=5)

  for line in f:       
      if ('cumulative effect' in line or 'Cumulative effect' in line) and ('accounting change' in line or 'adoption' in line or 'adopted' in line or 'charge' in line):
          break
      before.append(line)

before = ''.join(before)
findings = {'Filing': [filing],
        'Extracted_paragraph': [before]
        }

df = DataFrame(findings, columns= ['Filing', 'Extracted_paragraph'])

And then continue from there to export to Excel, etc.

answered May 15 '19 at 17:11

Jack Fleeting

24,385
6
23
45

Thanks for the suggestions! I extended your approach to also output the line with the keyword and the subsequent 5 lines as 3 different cells which I am later going to merge manually! Do you know by any chance how I could now automate this process for all my .txt files and tell python to append each result at the bottom of the excel? I tried creating a reference-file with all the .txt file names but I don't know how to create the loop in python.. Thanks in advance! – lnrd May 15 '19 at 19:34
@Inrd - It's hard to say in the abstract, but I would start by creating a list called `filings` which includes the name of each text file. Then surround the whole code above with `for file in filings: [run that code]`. It should work in principle, but may take a while to implement. – Jack Fleeting May 15 '19 at 19:51

How to search for a combination of keywords in a text-file, extract lines above and below, and then export to Excel using pandas

1 Answers1