1

Working on a Python 3.6 read of a text file to extract relative lines to convert into a pandas dataframe.

What works: Searching for a phrase in a text document and converting the line into a pandas df.

import pandas as pd
df = pd.DataFrame()
list1 = []
list2 = []

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project:' in line:
            line = line.strip('\n')
            list1.append(repr(line))

# Convert list1 into a df column
df = pd.DataFrame({'Project_Name':list1})

What doesn't work: Returning a relative line based on the search result. In my case I need to store the "relative" line -6 to -2 (earlier in the text) as Pandas columns.

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project:' in line:
            list2.append(repr(line)-6)  #<--- can't use math here

Returns: TypeError: unsupported operand type(s) for -: 'str' and 'int'

Also tried using a range with partial success:

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project' in line:
            all_lines = f.readlines()
            required_lines = [all_lines[i] for i in range(lineno-6,lineno-2)]
            print (required_lines)
            list2.append(required_lines)  #<-- does not work

Python will print the first 4 target lines but it does not seem to be able to save it as a list or loop through each finding of "Project" in the text doc. Is there a better way to save the results of the relative line above (or below) the search term? Thanks much.

Text data looks like:

0  Exhibit 3
1  Date: February 2018
2  Description
3  Description
4  Description
5  2015
6  2016
7  2017
8  2018
9  $100.50    <----  Add these as different dataframe columns
10 $120.33    <----
11 $135.88    <----
12 $140.22    <----
13 Project A
14
15 Exhibit 4
16 Date: February 2018
17 Description
18 Description
19 2015
20 2016
21 2017
22 2018
23 $899.25    <----
24 $901.00    <----
25 $923.43    <----
26 $1002.02   <----
27 Project B
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
Arthur D. Howland
  • 4,363
  • 3
  • 21
  • 31

2 Answers2

1

This might do the trick, it does make the assumption that there are always four values before the 'Project' line.

>>> a = []
>>> with open('test.txt') as f:
...     prev_lines = []
...     for line in f:
...         prev_lines.append(line.strip('\n'))
...         if 'Project' in line:
...             a.append(prev_lines[-5:])
...             del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCDi'))
>>> df
         A        B        C         D          i
0  $100.50  $120.33  $135.88   $140.22  Project A
1  $899.25  $901.00  $923.43  $1002.02  Project B

Or without the project included:

>>> a = []
>>> with open('test.txt') as f:
...     prev_lines = []
...     for line in f:
...         prev_lines.append(line.strip('\n'))
...         if 'Project' in line:
...             a.append(prev_lines[-5:-1])
...             del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCD'))
>>> df
         A        B        C         D
0  $100.50  $120.33  $135.88   $140.22
1  $899.25  $901.00  $923.43  $1002.02
Alex
  • 6,610
  • 3
  • 20
  • 38
  • Works a lot better than my latest, this is the first time I've seen "prev_lines = [ ]" list construction twice in the same block. Never thought of that. – Arthur D. Howland Aug 07 '18 at 13:15
  • I've updated the code to use a slightly better method of clearing the previous lines list. When I tested this on a file with 2000 records in it ran pretty quickly. – Alex Aug 07 '18 at 17:13
  • Is there a way to go forward 4 columns? I tried a.append(next_lines[:4]) but it skips the first instance. – Arthur D. Howland Aug 15 '18 at 18:23
  • I'm not sure I fully understand, do you mean access the rows from `project` onwards? This method works because the input is repetitive, but it could be achieved. Let me know what specific lines it should include. – Alex Aug 16 '18 at 08:55
  • Alex - yes, we got the previous 4 rows to work, but now I'm trying to get the NEXT 4 rows to go into a pandas df. So in the above example, when searching for 'Project' it would return: (blank), Exhibit 4, Date: February 2018, Description. – Arthur D. Howland Aug 20 '18 at 14:43
0

The reason your second solution is not working is because you are reading the file using a generator like object (f in your case), which one it finishes iterating through the file, will stop.

Your iteration for lineno, line in enumerate(f, 1): is meant to iterate line by line inside the file, but in a memory efficient manner by only reading one line at a time. When you find a matching line you do, all_lines = f.readlines() which consumes the generator. When the next iteration in for lineno, line in enumerate(f, 1): is called it raises a StopIterationError which causes the loop to stop.

You can make your second solution work if you read the entire contents of the file first and then iterate through that list instead.

If you want to be memory efficient, you can maintain a FIFO queue of the required number of lines.

Karthik V
  • 1,867
  • 1
  • 16
  • 23
  • Tried using relative_line = f.readlines(), Line6 = [relative_line[lineno - 6]]. Not working either. I'm not using f.readlines() correctly. – Arthur D. Howland Aug 06 '18 at 17:32