I'm trying to find my most common words used in a whatsapp chat exported to a .txt file. This code works...
from collections import Counter
import re
words = re.findall(r'\w+', open('chat.txt').read().lower())
print(Counter(words).most_common(10))
...however it's including all the dates as well as my own name and the recipients name. What can I add so it ignores certain words? (I'm sure there's a very simple solution but I'm very new to python.) Thanks!
Edit:
I didn't explain my question very well I now understand. I realised I can't be very specific as I have mainly just been copying code example and experimenting with what works rather than analysing the code itself.
I'm trying to find the most common word in a .txt file that is an archived whatsapp chat, a little boring example:
"[06/12/2017, 18:09:10] Name1 Surname1: just on the tube now
[06/12/2017, 18:09:29] Name1 Surname1: takes me like 25 mins so I’m gunna be cutting it fine
[06/12/2017, 18:36:16] Name2 Surname2: I’m just waiting by platform 11
[16/12/2017, 00:06:34] Name2 Surname2: My message isn’t sending
[16/12/2017, 00:10:55] Name1 Surname1: ?
[16/12/2017, 00:11:14] Name1 Surname1: for some reason these have only just come through"
In the first edit of this post using the code above, this was the result:
[('2018', 8552), ('name1', 6753), ('surname1', 6625), ('02', 4520), ('03', 3810), ('i', 3322), ('you', 2275), ('name2', 2016), ('01', 1995), ('surname2', 1991)]
So it was including the dates and names, which I want to exclude.
This code however:
from collections import Counter
with open('_chat.txt') as fin:
counter = Counter(fin.read().strip().split())
print(counter.most_common(10))
Doesn't include numbers. However, it does still include a few unwanted words like the names and "meaningless" words like 'the' and 'and':
[('Name1', 6686), ('Surname1:', 6615), ('I', 2277), ('Name2', 2000), ('Surname2:', 1990), ('you', 1714), ('to', 1488), ('and', 1084), ('a', 885), ('the', 881)]
What can I add to this to remove these kinds of words?
I understand this is similar to How do I remove entries within a Counter object with a loop without invoking a RuntimeError? but when I've tried to format my code similarly to this it hasn't been successful, and am a little confused by how this works too. (Sorry for being dense, when I say I'm very new to python I mean very very new.)