I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate account names could be comprised of all upper-case or all lower-case letters.
Disclaimer: this is just a internal research/experiment and no real action will be taken on the classifier outcome.
In my particular, there are 2 possible characteristics that can reveal an account name as suspicious, gibberish or both:
Weird/random spelling in name or name consists of purely or mostly numbers. Examples of account names that fit these criteria are: 128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds.
The name has 2 components (let's assume that no name will ever have more than 2 components) and the spelling and pronounciation of the 2 components are very similar. Examples of account names that fit these criteria are: Jala Haja, Hata Yaha, Faja Kaja.
If an account name meets both of the above criteria (i.e. 'asdfs lsdfs', '332 333') it should also be considered suspicious.
On the other hand, a legitimate account name doesn't need to have both first name and last name. They are usually names from popular languages such as Roman/Latin (i.e. Spanish, German, Portuguese, French, English), Chinese, and Japanese.
Examples of legitimate account names include (these names are made up but do reflect similar styles of legitimate account names in real world): Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng.
I've seen some slightly similar questions on Stackoverflow that asks for ways to detect gibberish texts. But those don't fit my situation because legitimate texts and words actually have meanings, whereas human names usually don't. I also want to be able to do it just based on account names and nothing else.
Right now my script takes care of finding the 2nd characteristic of suspicious account names (similar components in name) using Python's Fuzzy Wuzzy package and using 50% as the similarity threshold. The script is listed below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import numpy as np
accounts = pd.read_csv('dataset_with_names.csv', encoding = 'ISO-8859-1', sep=None, engine='python').replace(np.nan, 'blank', regex=True)
pd.options.mode.chained_assignment = None
accounts.columns = ['name', 'email', 'akon_id', 'acct_creation_date', 'first_time_city', 'first_time_ip', 'label']
accounts['name_simplified']=accounts['name'].str.replace('[^\w\s]','')
accounts['name_simplified']=accounts['name_simplified'].str.lower()
sim_name = []
for index, row in accounts.iterrows():
if ' ' in row['name_simplified']:
row['name_simplified']=row['name_simplified'].split()
if len(row['name_simplified']) > 1:
#print(row['name_simplified'])
if fuzz.ratio(row['name_simplified'][0], row['name_simplified'][1]) >= 50:
sim_name.append('True')
else:
sim_name.append('False')
else:
sim_name.append('False')
else:
sim_name.append('False')
accounts['are_name_components_similar'] = sim_name
The result has been reliable for what the script was designed to do, but I also want to be able to surface gibberish account names with the 1st characteristic (weird/random spelling or name consists of purely or mostly numbers). So far I have not found a solution to that yet.
Can anyone help? Any feedback/suggestion will be greatly appreciated!