Whatsapp chat log parsing with regex

Question

I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.

The chat.txt file looks like this:

[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::

While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: : as the sender.

Here is the regex I am working with so far:

pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')

Any advice on how I could go around this bug would be appreciated!

I can't see a problem with your current case. [`Person Two` is captured as `sender`](https://regex101.com/r/rhdbqw/2), isn't it expected? What is expected and why? — Wiktor Stribiżew, Mar 08 '19 at 16:43
As I said, the problem only occurs when there is an additional colon after the first colon which leads to the 'sender' output to be "Person Two: :" — Philipp K, Mar 08 '19 at 20:01
But isn't it OK? `Person Two` *is* the sender. What is the expected output for the given string? — Wiktor Stribiżew, Mar 08 '19 at 20:03
It's just this second colon that gets added into the sender column. So for example I would have three rows in the sender-column that say "Person Two" and then another row that says "Person Two: :" which is not ideal for analyzing the data. — Philipp K, Mar 08 '19 at 20:14
Again, you only capture `Person Two`, see closely [here](https://regex101.com/r/rhdbqw/2). No colon in Group 3. — Wiktor Stribiżew, Mar 08 '19 at 20:15
That's weird because for some reason it does not behave like this inside my project — Philipp K, Mar 08 '19 at 21:16
Try `(?m)^\[(?P\d{2}(?:\.\d{2}){2}),\s*(?P\d{2}(?::\d{2}){2})]\s*(?P[^:]*):\s*(?P.*(?:\n(?!\[\d{2}(?:\.\d{2}){2}).*)*)`, see [demo](https://regex101.com/r/rhdbqw/3). — Wiktor Stribiżew, Mar 08 '19 at 21:20

score 1 · Answer 1 · answered Mar 08 '19 at 16:07

1

i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g

 line = [06.12.16, 16:47:22] Person Two: ::
 line = line.replace("::","")

which would give :

[06.12.16, 16:47:22] Person Two:

You can then call your regex function on the pre-processed data.

answered Mar 08 '19 at 16:07

Nick

3,454
6
33
56

That would alter the message body though. The second colon is obviously a part of the message that has been sent and I don't really want to change anything there. – Philipp K Mar 08 '19 at 20:02

score 1 · Answer 2 · answered Mar 16 '21 at 13:08

I encountered similar issues when building a tool to analyze WhatsApp chats.

The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....

The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).

Filtering general:

const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;

Filter System Messages:

const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;

Date:

const regexSplitDate = /[-/.] ?/;

Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)

const regexAttachment = /<.+:(.+)>/;`

Whatsapp chat log parsing with regex

2 Answers2

Linked