0

Suppose I have this data in a text file, the script extracts everything between index1 and index2 and includes those strings in the output file. But for some reason it stops a few lines before index2.

Dumb Data

index1 0000

random data

index1 0000

random data

index1 0000

index2 0000

Here is my code; it starts writing to my output file as soon as it sees index1, but then if it sees index2, it should write that last match and exit. But it never exits, it seems to hang and stop a few lines before index2, always on the same line though. If the data wasn't sensitive I would paste the actual data.

import re
myvar = False
myfile = open('extract','w')

with open('input.txt') as f:
    for line in f:
        if re.search(r'index1', line):
            myvar = True
            myfile.write(line)

        elif re.search(r'index2', line):
            myvar = False
            break

        elif myvar == True:
            myfile.write(line)
            continue

myfile.close
f.close

The thing is, it works with my dummy data, but not with the real data, it stops on this line. It starts with a form feed, which I though might be messing it up, but there are multiple form feeds before this one which is printed to the output file.

FF (redacted) whitespace whitespace (redacted) datetime at datetime page 50

Thank you.

arealhobo
  • 447
  • 1
  • 6
  • 17
  • What is the expected output? What is the actual output? If there is an error, what is it? – YSelf Feb 22 '18 at 00:30
  • Well it should have everything between index1 and index2, but it stops a few lines before index2. So in the output file it looks like it is still being written to, has index1 random data index1 random data and stops there, doesn't get to index2. There is no error, to me it seems it is still running, from what I can tell, the output file stays at 0kb, but there is data in it. I am running it with idle and it just shows this in the shell -------- RESTART: C:\myprogram\myscript.py >>> – arealhobo Feb 22 '18 at 00:36
  • Only thing I can think of, its no reading the line as a string? – arealhobo Feb 22 '18 at 00:38
  • For this part in the script, the actual string goes like this; elif re.search(r'String Part 0806', line); maybe because it ends in a number? – arealhobo Feb 22 '18 at 00:39
  • You don't need the continue statement. A for loop with continue until the end of the file. Either use and IDE to step through the code or put print statements after the for and each if-else so you can see where the execution path. – LAS Feb 22 '18 at 00:43
  • Have you examined the raw string? My first thought was a mismatch between the visual (word wrap) and the actual line end (\n or \n\r). If you are only looking at the input in a text editor, it may look like it is skipping some lines. Your code runs on the sample text you've printed above, this suggests you need to look all the characters including non-visual ones. – Alan Feb 22 '18 at 01:06
  • @LAS - I did remove the continue, and it still stops in the same place. Added some print statements and it does find the index2 then prints and breaks, but in the output file it still stops before the expected match. – arealhobo Feb 22 '18 at 01:11
  • @Alan - Whats the best way to approach non-visual charactrs, notpadd++? Would it be the regex causing the issue, so I would need to modify it? – arealhobo Feb 22 '18 at 01:13
  • @MPineda Notepad++ should do it - you need use View->Show Symbols->Show all characters. You'll see symbols like "CR" and "LF" which are the line endings. If you see any other symbols, this might be the cause of the crash. – Alan Feb 22 '18 at 01:33
  • Also, a few points about the code. You can wrap the output file in a context manager as well as the input i.e. `with open('extract.txt', 'w') as myfile: with open('input.txt', 'r') as f:` – Alan Feb 22 '18 at 01:36
  • Using the debugger I get a KeyError on "re.search(r'index1', line)" not sure what that means, the 'r' is not in the dictionary? – arealhobo Feb 22 '18 at 01:36
  • Secondly, print(re.findall(r'index1.*?index2',f.read(), re.DOTALL)) will return a tuple with all the matches - there's no need to manually loop through the file. – Alan Feb 22 '18 at 01:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165602/discussion-between-alan-and-mpineda). – Alan Feb 22 '18 at 01:44
  • @MPineda If you have an IDE like wing, you can step through the code one line at a time and see the contents of your variables. – LAS Feb 22 '18 at 02:09
  • @MPineda I copied your data and code onto my machine and it ran fine. I believe your issue is with the data file. First thought is the encoding. The input data was copied to my machine as Unicode (UTF-8). Files can also become corrupt so I'd look at it in HEX and if that doesn't work, I'd recreate it. – LAS Feb 22 '18 at 02:43
  • Hmm, does shows as ANSI – arealhobo Feb 22 '18 at 02:45
  • @LAS - still stops on same line as before after changing to UTF-8, not sure what you mean by HEX – arealhobo Feb 22 '18 at 02:47

1 Answers1

1

Following our discussion ...

You can simply your code, eliminate the loop and remove the cause of your error by switching from re.search to re.findall. This will produce a list - technically a tuple - with all the matches.

If you want to eliminate duplicates, you can transfer the list to a set, which is an unordered list without duplicates.

You should also wrap the output file in a context manager (with open) in the same way you have the input file. This has a better chance of closing the file properly.

If you want to take actions on the set, you can loop through it as if it were a list, or if you need to get just one element (e.g. for testing on the next part of your code), you can convert to a list - list(j)[0]

import re

output = []
with open("extract.txt", 'w') as myfile:
    with open("input2.txt", 'r') as f:
        output = re.findall(r'index1.*?index3',f.read(), re.DOTALL)
    j = set(output)
    for x in j:
        myfile.write(x + '\n')

With a single element, it would change to:

with open("extract.txt", 'w') as myfile:
    with open("input2.txt", 'r') as f:
        output = re.findall(r'index1.*?index3',f.read(), re.DOTALL)
    myfile.write(list(set(output))[0] + '\n')
Alan
  • 2,914
  • 2
  • 14
  • 26