0

I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.

The chat.txt file looks like this:

[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::

While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: : as the sender.

Here is the regex I am working with so far:

pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')

Any advice on how I could go around this bug would be appreciated!

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Philipp K
  • 193
  • 2
  • 18
  • 1
    I can't see a problem with your current case. [`Person Two` is captured as `sender`](https://regex101.com/r/rhdbqw/2), isn't it expected? What is expected and why? – Wiktor Stribiżew Mar 08 '19 at 16:43
  • As I said, the problem only occurs when there is an additional colon after the first colon which leads to the 'sender' output to be "Person Two: :" – Philipp K Mar 08 '19 at 20:01
  • But isn't it OK? `Person Two` *is* the sender. What is the expected output for the given string? – Wiktor Stribiżew Mar 08 '19 at 20:03
  • It's just this second colon that gets added into the sender column. So for example I would have three rows in the sender-column that say "Person Two" and then another row that says "Person Two: :" which is not ideal for analyzing the data. – Philipp K Mar 08 '19 at 20:14
  • 1
    Again, you only capture `Person Two`, see closely [here](https://regex101.com/r/rhdbqw/2). No colon in Group 3. – Wiktor Stribiżew Mar 08 '19 at 20:15
  • That's weird because for some reason it does not behave like this inside my project – Philipp K Mar 08 '19 at 21:16
  • Try `(?m)^\[(?P\d{2}(?:\.\d{2}){2}),\s*(?P – Wiktor Stribiżew Mar 08 '19 at 21:20

2 Answers2

1

i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g

 line = [06.12.16, 16:47:22] Person Two: ::
 line = line.replace("::","")

which would give :

[06.12.16, 16:47:22] Person Two: 

You can then call your regex function on the pre-processed data.

Nick
  • 3,454
  • 6
  • 33
  • 56
  • That would alter the message body though. The second colon is obviously a part of the message that has been sent and I don't really want to change anything there. – Philipp K Mar 08 '19 at 20:02
1

I encountered similar issues when building a tool to analyze WhatsApp chats.

The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....

The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).

Filtering general:

const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;

Filter System Messages:

const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;

Date:

const regexSplitDate = /[-/.] ?/;

Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)

const regexAttachment = /<.+:(.+)>/;`
PKL
  • 95
  • 7