0

I'm trying to find my most common words used in a whatsapp chat exported to a .txt file. This code works...

from collections import Counter
import re
words = re.findall(r'\w+', open('chat.txt').read().lower())
print(Counter(words).most_common(10))

...however it's including all the dates as well as my own name and the recipients name. What can I add so it ignores certain words? (I'm sure there's a very simple solution but I'm very new to python.) Thanks!

Edit:

I didn't explain my question very well I now understand. I realised I can't be very specific as I have mainly just been copying code example and experimenting with what works rather than analysing the code itself.

I'm trying to find the most common word in a .txt file that is an archived whatsapp chat, a little boring example:

"[06/12/2017, 18:09:10] Name1 Surname1: just on the tube now

[06/12/2017, 18:09:29] Name1 Surname1: takes me like 25 mins so I’m gunna be cutting it fine

[06/12/2017, 18:36:16] Name2 Surname2: I’m just waiting by platform 11

[16/12/2017, 00:06:34] Name2 Surname2: My message isn’t sending

[16/12/2017, 00:10:55] Name1 Surname1: ?

[16/12/2017, 00:11:14] Name1 Surname1: for some reason these have only just come through"

In the first edit of this post using the code above, this was the result:

[('2018', 8552), ('name1', 6753), ('surname1', 6625), ('02', 4520), ('03', 3810), ('i', 3322), ('you', 2275), ('name2', 2016), ('01', 1995), ('surname2', 1991)]

So it was including the dates and names, which I want to exclude.

This code however:

from collections import Counter

with open('_chat.txt') as fin:
counter = Counter(fin.read().strip().split())

print(counter.most_common(10))

Doesn't include numbers. However, it does still include a few unwanted words like the names and "meaningless" words like 'the' and 'and':

[('Name1', 6686), ('Surname1:', 6615), ('I', 2277), ('Name2', 2000), ('Surname2:', 1990), ('you', 1714), ('to', 1488), ('and', 1084), ('a', 885), ('the', 881)]

What can I add to this to remove these kinds of words?

I understand this is similar to How do I remove entries within a Counter object with a loop without invoking a RuntimeError? but when I've tried to format my code similarly to this it hasn't been successful, and am a little confused by how this works too. (Sorry for being dense, when I say I'm very new to python I mean very very new.)

PEOlhc
  • 9
  • 3
  • Possible duplicate of [How do I remove entries within a Counter object with a loop without invoking a RuntimeError?](https://stackoverflow.com/questions/7154312/how-do-i-remove-entries-within-a-counter-object-with-a-loop-without-invoking-a-r) – Anton vBR Apr 01 '18 at 22:08
  • Have you tried a more refined regex? `\w+` is too broad. – Gomes J. A. Apr 01 '18 at 22:09
  • You could try to store Counter(words) to a variable. Apply loop (follow the dupe) to remove undesired elements and lastly get the most common. – Anton vBR Apr 01 '18 at 22:09
  • 1
    We can't help you much here since we do not know why you are using `\w+`, what format the dates are in, nor the sample input. Please edit the question so that it could be answerable. – Wiktor Stribiżew Apr 01 '18 at 22:34
  • I've edited the question, not sure if that makes any more sense. Sorry for the confusion :-( – PEOlhc Apr 03 '18 at 12:11
  • If the text is really formatted with newlines as you've shown, I would suggest looping over the lines and using `defaultdict` like `for line in lines: for word in line.split(): defaultdict[word] += 1`, except you skip over the first 4 words. – Nimitz14 Apr 03 '18 at 12:14

1 Answers1

0

Looking at your input I'd recommend cleaning it before you put it into the Counter.

If you have a file with lines looking like this:

[06/12/2017, 18:09:10] Name1 Surname1: just on the tube now

Then you can clean off the date by looking for the first closing ] and slicing off after that, then clean off the name by doing the something similar for the :. The lines in the file can be read using file.readlines() and then each one processed, e.g.

with open('chat.txt') as f:
    lines = f.readlines()
def clean_line(line):
    """
       Find the first ], assume it's followed by a space and
         slice off everything after that
       Split the line on the first : and take the second part
         of the resulting list
    """
    return line[line.find(']')+2:].split(':', 1)[1]
words = []
for line in lines:
    words += clean_line(line).lower().split()

counted_words = Counter(words)
grahamlyons
  • 687
  • 5
  • 15
  • Ah I see, this makes more sense thank you. I've tried adding this and I'm receiving a syntax error saying that the " words = [clean_line(line).lower().split() for line in lines)" is invalid? Also where do I then reintroduce the counter? I'm not sure how to structure the code as a whole. ((((Sorry again for being dim)))) – PEOlhc Apr 03 '18 at 19:12
  • No need to apologise. I made a syntax error, which I fixed, but I shouldn't have done a list comprehension because we end up with a list of lists which just has to be flattened. Instead I've done it as a simple loop and appended to another variable, then added the Counter as an example. Hope that helps. – grahamlyons Apr 03 '18 at 23:22
  • this is still not working for me unfortunately, I'm getting "IndexError: list index out of range"? – PEOlhc Apr 10 '18 at 12:19
  • It's hard to tell exactly what that error refers to without seeing your code and the data its operating on. I'd guess it must be a line without a `:` character so the `.split(':`, 1)` produces a list with 1 item and trying to access the second one (`[1]`) fails. You can `try/except` the `words += clean_line...` part and print out the failing lines in the `except` block. (Hope that makes sense). – grahamlyons Apr 11 '18 at 11:43