I am trying to prepare a whatsapp file for analysis. I need it divided into three columns: time, name, and message. The text contains some messages in the conversations that have a line break. When I load it into a dataframe, these messages show up as their own lines rather than part of one message.
4/16/19, 15:22 - +254 123 123: Hi my T. L
4/16/19, 15:22 - +254 123 124: Was up
4/17/19, 06:28 - member: Hi team details are Thursday 18 April,
Venue: Hilton Hotel
Time: 07:30am
Come on time team!
4/17/19, 12:17 - member: Hi guys
4/17/19, 12:18 - member: How many are coming tomorrow?
I have tried using two approaches:
directly parsing using regex directly as indicated in these solutions here and here on stackoverflow and this blog as well
indirectly by creating a file where these multi-line messages are compiled into one line as found here
Both approaches have failed :( My favorite was the second approach, only because you are able to create a file that can be used by other platforms as well e.g. excel, tableau...
For approach 2:
import re
from datetime import datetime
start = True
with open('test.txt', "r", encoding='utf-8') as infile, open("output.txt", "w", encoding='utf-8') as outfile:
for line in infile:
time = re.search(r'\d\d\/\d\d\/\d\d,.(24:00|2[0-3]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9])', line)
sender = re.search(r'-(.*?):', line)
if sender and time:
date = datetime.strptime(
time.group(),
'%m/%d/%Y, %H:%M')
sender = sender.group()
text = line.rsplit(r'].+: ', 1)[-1]
new_line = str(date) + ',' + sender + ',' + text
if not start: new_line = new_line
outfile.write(new_line)
else:
outfile.write(' ' + line)
start = False
I'd hope that I finally move from getting:
4/17/19, 06:28 - member: Hi team details are Thursday 18 April,
Venue: Hilton Hotel
Time: 07:30am
Come on time team!
and get:
4/17/19, 06:28 - member: Hi team details are Thursday 18 April, Venue: Hilton Hotel Time: 07:30am Come on time team!
Also, output it as a dataframe with the datetime, member, and message all correctly done.