1

I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

I ran into one weird problem where a few of the files I'm parsing through spew out weird characters in the middle of a line, ruining my parsing of readline() returns. When reading in a text editor, the line in question looks normal, but readline() reads an '=' and two '\n' characters right smack in the middle of an IP.

e.g.

Normal return from readline():
"IP Address: xxx.xxx.xxx.xxx"

Broken readline() return:
"IP Address: xxx.xxx.xxx="

The next two lines after that being:
""
".xxx"

Any idea how I could get around this? I don't really have control over what problem could be causing this, I just kind of need to deal with it without getting too crazy.

Relevant function, for reference (I know it's a mess):

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw = ce.readline()
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
    if ip:
        return ip[0]
        ce.close()
    else:
        return ("No IP found in: " + ipraw)
        ce.close()
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
noringo
  • 11
  • 3
  • Are you sure that there is an `=` character before two `\n` only? What about some other IP has some other character like `=` and may be more than one? In case you only have `=\n\n` you can write your regex for IP to account this by having `(?:=\n*)?` just before your last IP part `.xxx` – Pushpesh Kumar Rajwanshi Mar 21 '19 at 19:41
  • The issue is I'm only applying regex after reading the line into a string, and the new line characters break the string apart. My first instinct would be to read 3 lines, concatenate them, then regex, but that would be a pretty big extra load on the script if it was run each time, and it would be pretty spaghetti-code if I just stuck it in yet another else: at the end, since I'd need to save the line position and go back to it if the "normal" searches don't work. – noringo Mar 21 '19 at 19:49
  • If your data is split across multiple lines, I suggest you to at least work on a string by combining two lines at least and in each step read one more line and discard the first line and join the second line with next new line and iterate this way, otherwise capturing/extracting right patterns will be hard for you. – Pushpesh Kumar Rajwanshi Mar 21 '19 at 19:54
  • 1
    Ended up just saving the earlier read lines, combining them, then using re.sub to remove (=\r*\n), and it works (turns out there was also a \r character in between the = and \n, which was confusing for a minute). Thanks for your help. – noringo Mar 21 '19 at 20:36
  • If you've solved the problem, please add and accept it as an answer, rather than putting the solution in the question. – glibdud Mar 21 '19 at 20:49
  • Ah, sorry, first time actually posting a question here. Can do. – noringo Mar 21 '19 at 20:50

2 Answers2

1

It seems likely that at least some of the emails that you are processing have been encoded as quoted-printable.

This encoding is used to make 8-bit character data transportable over 7-bit (ASCII-only) systems, but it also enforces a fixed line length of 76 characters. This is implemented by inserting a soft line break consisting of "=" followed by the end of line marker.

Python provides the quopri module to handle encoding and decoding from quoted-printable. Decoding your data from quoted-printable will remove these soft line breaks.

As an example, let's use the first paragraph of your question.

>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""

>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')

>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
 emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

To decode correctly, the entire message body needs to be processed, which conflicts with your approach using readline. One way around this is to load the decoded string into a buffer:

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

If your files contain complete emails - including headers - then using the tools in the email module will handle this decoding automatically.

import email
from email import policy

with open('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
0

Solved, if anyone else has a similar problem, save each line as a string, merge them together, and re.sub() them out, keeping in mind \r and \n characters. My solution is a bit spaghetti, but prevents unneeded regex being done on every file:

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw2 = ce.readline()                              #made this a new var
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw2)
    if ip:
        return ip[0]
        ce.close()
    else:
        ipraw = ipraw + ipraw2                          #Added this section
        ipraw = re.sub(r'(=\r*\n)', '', ipraw)          #
        ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
        if ip:
            return ip[0]
            ce.close()
        else:
            return ("No IP found in: " + ipraw)
            ce.close()
noringo
  • 11
  • 3