0

I have a file of WhatsApp messages which I want to save into csv format. File looks like this:

[04/02/2018, 20:56:55] Name1: ‎Messages to this chat and calls are now secured with end-to-end encryption.
[04/02/2018, 20:56:55] Name1: Content1.
More content.
[04/02/2018, 23:24:44] Name2: Content2.

I want to parse messages into date, sender, text columns. My code:

with open('chat.txt', "r") as infile, open("Output.txt", "w") as outfile:
    for line in infile:
        date = datetime.strptime(
            re.search('(?<=\[)[^]]+(?=\])', line).group(), 
            '%d/%m/%Y, %H:%M:%S')
        sender = re.search('(?<=\] )[^]]+(?=\:)', line).group()
        text = line.rsplit(']', 1)[-1].rsplit(': ', 1)[-1]

        new_line = str(date) + ',' + sender + ',' + text
        outfile.write(new_line)

I have problems with handling multi line texts. (I sometimes skipped into a new line in my messages - in this case I have only text in the line which is supposed to be a part of the previous line.) I'm also open to more pythonic way of parsing datetime, sender, and text. The result of my code is error because every line doesn't have all criteria (but correctly parses date, sender, text):

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-33-efbcb430243d> in <module>()
      3     for line in infile:
      4         date = datetime.strptime(
----> 5             re.search('(?<=\[)[^]]+(?=\])', line).group(),
      6             '%d/%m/%Y, %H:%M:%S')
      7         sender = re.search('(?<=\] )[^]]+(?=\:)', line).group()

AttributeError: 'NoneType' object has no attribute 'group'

Idea: maybe using try-catch and then somehow appending line with only text? (Doesn't sound Pythonic.)

mihagazvoda
  • 1,057
  • 13
  • 23
  • The regex for sender: `(?<=\] )[^]]+(?=\:)` - I think you should change it to `(?<=\] )[^]]+?(?=\:)` – Wololo Jul 25 '18 at 20:01
  • The regex for `date` looks fine (demo::https://regex101.com/r/TXqxPK/1). Make sure that line is not empty or something – Wololo Jul 25 '18 at 20:08
  • Both regex works fine (I printed the output). The problem is I sometimes use new line command in my messages - in this case I have only text in my line which is supposed to be a part of the previous line. – mihagazvoda Jul 25 '18 at 20:10
  • 1
    I think you should read a line. Check if it starts with `[`. If it does, it means that the read line is a new message. If it doesn't, this means that the read line is a part of the previous message's content. So append it to the previous message's content – Wololo Jul 25 '18 at 20:21
  • That's a good idea. Any idea how to append it to the previous line? – mihagazvoda Jul 25 '18 at 20:34
  • 1
    Create a temporary variable, say `x`, and set it to empty string. Then open the input stream. Read a line, check if the first non-whitespace character is `[`. If it is then first flush `x` onto the ouput stream. Then set `x` to `str(date) + ',' + sender + ',' + text`. If the first non-whitespace character is not `[`, then simply set x to `x + line` (do not output anything) – Wololo Jul 25 '18 at 20:48
  • I am commenting on mobile. So the formatting might be disturbed ... sorru – Wololo Jul 25 '18 at 20:50

1 Answers1

1

Here is something that should work to append the extra text to the previous line.

This is checking whether the regex fails, in which case just write the line to the file without a newline \n so it just appends to the previous line in the file.

start = True

with open('chat.txt', "r") as infile, open("Output.txt", "w") as outfile:
    for line in infile:
        time = re.search(r'(?<=\[)[^]]+(?=\])', line)
        sender = re.search(r'(?<=\] )[^]]+(?=\:)', line)
        if sender and time:
            date = datetime.strptime(
                time.group(),
                '%d/%m/%Y, %H:%M:%S')
            sender = sender.group()
            text = line.rsplit(r'].+: ', 1)[-1]
            new_line = str(date) + ',' + sender + ',' + text
            if not start: new_line = '\n' + new_line
            outfile.write(new_line)
        else:
            outfile.write(' ' + line)
        start = False

It also looks like you weren't writing new lines to the file even when the regex worked, so I added that in too.

gommb
  • 1,121
  • 1
  • 7
  • 21