Please suggest improvements for fuzzy matching email header string values with Python

Question

I'm currently trying to match 2 values that are found in the From header of an email. Specifically, the Sender Name and the Email_ID. To illustrate here is an example of this headers content:

"Surname Lastname" <surname.lastname@company_domain.co.uk>

So the two parts I'm trying to match are :

Sender name: Surname Lastname

and

Email_ID: surname.lastname@company_domain.co.uk

For context, this is a part of a larger automated python workflow that is meant to detect spam based on several different criteria. Anyone who's ever worked with spam detection and email filtration knows that the values in email headers can contain widely different values and formats (something that is beginning to vex me greatly). Below are some mildly modified examples of email sender headers I use as test data. The percentages are taken from similar testdata so there may be some small difference if you were to use them but i've made sure to maintain lengths and character sets so it should be close enough.

"Angelina Wolfe" <angelina.wolfe@luxatiainternational.net>   # (81%)
"David.se" <david@datatracks.se>                             # (55%)
"ZoomInfo Notification" <noreply@m.zoominfo-privacy.com>     # (50%)
"jackie.cobin2015@yahoo.in" <jackie.cobin2015@yahoo.in>      # (100%)
"Golgin Gurlukovich" <gg@acorn.ru>                           # (31%)

These are all valid matches (albeit that some can be considered spam). To get 100% correct matches for each example is proving very difficult so I'm trying for a close match (~70%) using a python library called fuzzywuzzy. My code currently looks like this:

from fuzzywuzzy import fuzz
# Data is extracted earlier with regexp from EML files.
# The sender and email_id will always be from the same EML file here so no mismatch is possible.   
tmp_sender = self.headers['sender'].lower().strip()         # sender name
tmp_emailID = self.headers['email_id'].lower().strip()      # email_id

sender_fuzz_ratio = fuzz.partial_ratio(tmp_sender, tmp_emailID) # Fuzzywuzzy confidence calculation

if tmp_sender == tmp_emailID or tmp_sender in tmp_emailID:  # Naive check if sender matches email_id
    self.verdict['Sender_fields_check'] = "Sender name '{}' matched email_id '{}'".format(self.iocs['sender'], self.iocs['email_id'])
elif sender_fuzz_ratio >= 70:   # Fuzzy check if sender matches email_id # TODO: Tweak me if needed.
    self.verdict['Sender_fields_check'] = "Sender name {} is a probable match for email_id {}, fuzz ratio confidence: {}%".format(self.iocs['sender'], self.iocs['email_id'], sender_fuzz_ratio)
else:
    self.verdict['Sender_fields_check'] = "Sender name {} does probably not match email_id {}, fuzzy match confidence: {}%".format(self.iocs['sender'],self.iocs['email_id'], sender_fuzz_ratio)
    self.verdict['Final_verdict'] = "Spam" # If it does not look like a match then we classify as spam.
    return self.verdict

So basically, it's not good enough in it's current state and I was hoping for suggestions on how to improve the confidence of the match. While it's important that the matches are correct, it's as important that it does not match to leniently

One way would be to choose a different algorithm for fuzzywuzzy but I'm not sure which one would be best from a balancing perspective in the long run. Another option would be to combine several checks using fuzzywuzzy and then base it on an average maybe... Or if there is a completely different way to do this better then please share, I would love to know more. Any and all suggestions for improvements are welcome and if I've been unclear then please let me know and I'll try to clarify.

Thanks for reading my question, cheers!

Please suggest improvements for fuzzy matching email header string values with Python

0 Answers0