2

I need to extract a portion of text from a txt file.
The file looks like this:

STARTINGWORKIN DD / MM / YYYY HH: MM: SS
... text lines ...
... more text lines ...
STARTINGWORKING DD / MM / YYYY HH: MM: SS
... text lines I want ...
... more text lines that I want ...

  • The file starts with STARTINGWORK and ends in text lines.
    I need to extract the final text portion after the last STARTINGWORK, without the STARTINGWORK str

I tried use 3 for loops (one to start, another read the between line, and the last to end)

     file = "records.txt"
     if file.endswith (".txt"):
       if os.path.exists (file):
         lines = [line.rstrip ('\ n') for line in open (file)]
         for line in lines:
             #extract the portion
Y4RD13
  • 937
  • 1
  • 16
  • 42

5 Answers5

2

Try this:

file = "records.txt"
extracted_text = ""
    if file.endswith (".txt"):
        if os.path.exists (file):
            lines = open(file).read().split("STARTINGWORKING")
            extracted_text = lines[-1] #Here it is
Akaisteph7
  • 5,034
  • 2
  • 20
  • 43
  • this will not work with more than 2 STARTWORKING. In the text can actually have 2 of it, but can increasy randomly. So the objective is reach the last lines after the last STARTINGWORK – Y4RD13 Jul 05 '19 at 17:30
  • @BenjamínSerra Why won't this work now? Did you actually try it? – Akaisteph7 Jul 05 '19 at 17:35
  • @BenjamínSerra the line: `open(file).read().split('STARTINGWORK')` will make a list of all the text sections in-between every occurrence of "STARTINGWORK". Then you can simply take the last element in that list, and the rest of Akaisteph7's code is just to properly remove the date and time. – Aaron Jul 05 '19 at 17:43
  • @BenjamínSerra Please mark the correct answer if your question was resolved. – Akaisteph7 Jul 05 '19 at 18:08
  • @Akaisteph7 Yes I'm trying with this code. But is still not working, I'm getting the whole text or the numbers 22 51 if I use extracted_text. I don't know what I'm doing wrong :/ – Y4RD13 Jul 05 '19 at 18:14
  • @BenjamínSerra Are you sure you are doing exactly what is here? Did you remove `lines = [line.rstrip ('\ n') for line in open (file)]` and `for line in lines:` from your code? – Akaisteph7 Jul 05 '19 at 18:24
  • Oh I can't put the code here, but I did exactly as you said. Removing `[line.rstrip ('\ n') for line in open (file)]` and `for line in lines:` from the code. Is giving me back this numbers 22 51, from the last line 08/06/2019 15:58:40 – Y4RD13 Jul 05 '19 at 19:03
2

You can use file_read_backwards module to read file from end to beginning. It helps you save time if the file size is big:

from file_read_backwards import FileReadBackwards

with FileReadBackwards("records.txt") as file:
    portion = list()
    for line in file:
         if not line.startswith('STARTINGWORKING'):
            portion.append(line)
         else:
            break
portion.reverse()

portion contains lines desired.

Masoud
  • 1,270
  • 7
  • 12
  • When I try this it give me back except OSError. I'm using try and except for validate the path of the file. – Y4RD13 Jul 05 '19 at 18:23
  • I changed the file name to `records.txt`. – Masoud Jul 05 '19 at 18:32
  • If I print portion, I'm getting the full text :/ – Y4RD13 Jul 05 '19 at 18:52
  • It works fine for me. maybe there is a character before `STARTING WORKING`. You can make small dummy text file and try the code on it and debug it. – Masoud Jul 05 '19 at 19:44
  • It work, was misspelled ('STARTINGWORKING') on the txt file. The only left is that I need to strip \n on the portion, but it's not big deal! Thank you – Y4RD13 Jul 05 '19 at 20:22
1

I would take the regex path to tackle this:

>>> import re
>>> input_data = open('path/file').read()
>>> result = re.search(r'.*STARTINGWORKING\s*(.*)$', input_data, re.DOTALL)
>>> print(result.group(1))
#'DD / MM / YYYY HH: MM: SS\n... text lines I want ...\n... more text lines that I want ...'
drec4s
  • 7,946
  • 8
  • 33
  • 54
1

The get_final_lines generator tries to avoid mallocing more storage than necessary, while reading a potentially large file.

def get_final_lines(fin):
    buf = []
    for line in fin:
        if line.startswith('STARTINGWORK'):
            buf = []
        else:
            buf.append(line)

    yield from buf


if __name__ == '__main__':
    with open('some_file.txt') as fin:
        for line in get_final_lines(fin):
            print(line.rstrip())
J_H
  • 17,926
  • 4
  • 24
  • 44
0

You can have a variable that saves all the lines you have read since the last STARTINGWORK.
When you finish processing the file you have just what you need.

Certainly you do not need to read all the lines to a list first. You can read it directly in the open file and that returns one line at a time. i.e.:

result = []
with open(file) as f:
    for line in f:
        if line.startswith("STARTINGWORK"):
            result = []       # Delete what would have accumulated
        result.append(line)  # Add the last line read
print("".join(result))

In the result you have everything after the last STARTINGWORK, inclusive you can keep the result [1:] if you want to delete the initial STARTINGWORK

- Then in the code:

#list
result = []

#function
def appendlines(line, result, word):
  if linea.startswith(word):
    del result[:]
  result.append(line)
  return line, result

with open(file, "r") as lines: 
  for line in lines:              
    appendlines(line, result, "STARTINGWORK")
new_result = [line.rstrip("\n") for line in result[1:]]
Y4RD13
  • 937
  • 1
  • 16
  • 42