1

I am trying to prepare a whatsapp file for analysis. I need it divided into three columns: time, name, and message. The text contains some messages in the conversations that have a line break. When I load it into a dataframe, these messages show up as their own lines rather than part of one message.

4/16/19, 15:22 - ‪+254 123 123‬: Hi my T. L

4/16/19, 15:22 - ‪+254 123 124‬: Was up

4/17/19, 06:28 - member: Hi team details are Thursday 18 April, 

Venue: Hilton Hotel

Time: 07:30am

Come on time team!

4/17/19, 12:17 - member: Hi guys

4/17/19, 12:18 - member: How many are coming tomorrow?

I have tried using two approaches:

  1. directly parsing using regex directly as indicated in these solutions here and here on stackoverflow and this blog as well

  2. indirectly by creating a file where these multi-line messages are compiled into one line as found here

Both approaches have failed :( My favorite was the second approach, only because you are able to create a file that can be used by other platforms as well e.g. excel, tableau...

For approach 2:

import re
from datetime import datetime

start = True

with open('test.txt', "r", encoding='utf-8') as infile, open("output.txt", "w", encoding='utf-8') as outfile:
    for line in infile:
        time = re.search(r'\d\d\/\d\d\/\d\d,.(24:00|2[0-3]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9])', line)
        sender = re.search(r'-(.*?):', line)
        if sender and time:
            date = datetime.strptime(
                time.group(),
                '%m/%d/%Y, %H:%M')
            sender = sender.group()
            text = line.rsplit(r'].+: ', 1)[-1]
            new_line = str(date) + ',' + sender + ',' + text
            if not start: new_line =  new_line
            outfile.write(new_line)
        else:
            outfile.write(' ' + line)
        start = False

I'd hope that I finally move from getting:

4/17/19, 06:28 - member: Hi team details are Thursday 18 April, 

Venue: Hilton Hotel

Time: 07:30am

Come on time team!

and get:

4/17/19, 06:28 - member: Hi team details are Thursday 18 April, Venue: Hilton Hotel Time: 07:30am Come on time team!

Also, output it as a dataframe with the datetime, member, and message all correctly done.

Sam
  • 113
  • 1
  • 11

1 Answers1

4

Regular expression

You will need the following regular expression:

^(\d{1,2})\/(\d{1,2})\/(\d\d), (24:00|2[0-3]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]) - (\S[^:]*?): (.*)$

Online test regex in sandbox.

Code

The received data is formed into an object for a DateFrame. At the end, for example, save the DateFrame object in a CSV file.

import re
from datetime import datetime
import pandas as pd

with open('test.txt', "r", encoding='utf-8') as infile:
    outputData = { 'date': [], 'sender': [], 'text': [] }
    for line in infile:
        matches = re.match(r'^(\d{1,2})\/(\d{1,2})\/(\d\d), (24:00|2[0-3]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]) - ((\S[^:]*?): )?(.*)$', line)
        if matches:
          outputData['date'].append(
            datetime(
              int(matches.group(3))+2000,
              int(matches.group(1)),
              int(matches.group(2)),
              hour=int(matches.group(4)[0:2]),
              minute=int(matches.group(4)[3:])
            ))
          outputData['sender'].append(matches.group(6) or "{undefined}")
          outputData['text'].append(matches.group(7))

        elif len(outputData['text']) > 0:
          outputData['text'][-1] += "\n" + line[0:-1]

    outputData = pd.DataFrame(outputData)
    outputData.to_csv('output.csv', index=False, line_terminator='\n', encoding='utf-8')

Online test full in sandbox.

  • Hey Uriy, you are a rockstar! There's only one issue. I have a line that ends with a bracket `file.jpg (file attached)`. Somehow, it picks up the next line and both show up in the same message like `file.jpg (file attached) 4/22/19, 09:30 - Person changed this group's icon`. I'm suspecting that it has something to do with the regex... – Sam Aug 19 '19 at 06:40
  • Hey, @Sam! This is because there is no difined Sender in line `4/22/19, 09:30 - Person changed this group's icon`. This seems to be some kind of system message. Can mark it with `sender = {undefined}`. I will correct the regular expression and algorithm. – Uriy MerkUriy Aug 19 '19 at 07:31
  • The answer has been updated to take account of new examples source strings of messages. – Uriy MerkUriy Aug 19 '19 at 07:44