Unable to detect gibberish names using Python

Question

I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate account names could be comprised of all upper-case or all lower-case letters.

Disclaimer: this is just a internal research/experiment and no real action will be taken on the classifier outcome.

In my particular, there are 2 possible characteristics that can reveal an account name as suspicious, gibberish or both:

Weird/random spelling in name or name consists of purely or mostly numbers. Examples of account names that fit these criteria are: 128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds.
The name has 2 components (let's assume that no name will ever have more than 2 components) and the spelling and pronounciation of the 2 components are very similar. Examples of account names that fit these criteria are: Jala Haja, Hata Yaha, Faja Kaja.

If an account name meets both of the above criteria (i.e. 'asdfs lsdfs', '332 333') it should also be considered suspicious.

On the other hand, a legitimate account name doesn't need to have both first name and last name. They are usually names from popular languages such as Roman/Latin (i.e. Spanish, German, Portuguese, French, English), Chinese, and Japanese.

Examples of legitimate account names include (these names are made up but do reflect similar styles of legitimate account names in real world): Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng.

I've seen some slightly similar questions on Stackoverflow that asks for ways to detect gibberish texts. But those don't fit my situation because legitimate texts and words actually have meanings, whereas human names usually don't. I also want to be able to do it just based on account names and nothing else.

Right now my script takes care of finding the 2nd characteristic of suspicious account names (similar components in name) using Python's Fuzzy Wuzzy package and using 50% as the similarity threshold. The script is listed below:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import pandas as pd
import numpy as np

accounts = pd.read_csv('dataset_with_names.csv', encoding = 'ISO-8859-1', sep=None, engine='python').replace(np.nan, 'blank', regex=True)

pd.options.mode.chained_assignment = None

accounts.columns = ['name', 'email', 'akon_id', 'acct_creation_date', 'first_time_city', 'first_time_ip', 'label']

accounts['name_simplified']=accounts['name'].str.replace('[^\w\s]','')
accounts['name_simplified']=accounts['name_simplified'].str.lower()

sim_name = []

for index, row in accounts.iterrows():        
    if ' ' in row['name_simplified']:
        row['name_simplified']=row['name_simplified'].split()
        if len(row['name_simplified']) > 1:
            #print(row['name_simplified'])
            if fuzz.ratio(row['name_simplified'][0], row['name_simplified'][1]) >= 50:
                sim_name.append('True')
            else:
                sim_name.append('False')
        else:
            sim_name.append('False')
    else:
        sim_name.append('False')        

accounts['are_name_components_similar'] = sim_name

The result has been reliable for what the script was designed to do, but I also want to be able to surface gibberish account names with the 1st characteristic (weird/random spelling or name consists of purely or mostly numbers). So far I have not found a solution to that yet.

Can anyone help? Any feedback/suggestion will be greatly appreciated!

David Dale · Accepted Answer · 2021-02-14T16:08:13.137

For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.

A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).

from nltk.corpus import brown
from collections import Counter
import numpy as np

text = '\n  '.join([' '.join([w for w in s]) for s in brown.sents()])

unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))

weights = [0.001, 0.01, 0.989]

def strangeness(text):
    r = 0
    text = '  ' + text + '\n'
    for i in range(2, len(text)):
        char = text[i]
        context1 = text[(i-1):i]
        context2 = text[(i-2):i]
        num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2] 
        den = sum(unigrams.values()) * weights[0] + unigrams[context1] * weights[1] + bigrams[context2] * weights[2]
        r -= np.log(num / den)
    return r / (len(text) - 2)

Now you can apply this strangeness measure to your examples.

t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
for t in t1 + t2:
    print('{:20} -> {:9.5}'.format(t, strangeness(t)))

You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 3.9 here.

128                  ->    5.5528
127                  ->    5.6572
h4rugz4sx383a6n64hpo ->    5.9016
tt                   ->    4.9392
t66                  ->    6.9673
t65                  ->    6.8501
asdfds               ->    3.9776
Michael              ->    3.3598
sara                 ->    3.8171
jose colmenares      ->    2.9539
Dimitar              ->    3.4602
Jose Rafael          ->    3.4604
Morgan               ->    3.3628
Eduardo Medina       ->    3.2586
Luis R. Mendez       ->     3.566
Hikaru               ->    3.8936
SELENIA              ->    6.1829
Zhang Ming           ->    3.4809
Xuting Liu           ->    3.7161
Chen Zheng           ->    3.6212

Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.

Thank you @David. This is a great suggestion! Instead of using the words from 'brown' corpus, do you think I could use a list of legitimate names I already have to train the model? — Stanleyrr, Jun 04 '18 at 01:15
Yes, you can do this, especially if the list is large enough. — David Dale, Jun 04 '18 at 03:45
what is the reason we start from 2 instead of 0 in this function: "for i in range(2, len(text)):"? — Stanleyrr, Sep 10 '19 at 16:09
@Stanleyrr that's because I added two spaces before the text. And I added them in order to use context of length 2 for each character. — David Dale, Sep 11 '19 at 09:53
Thanks, @David Dale. That makes sense. Could you also explain what "den = sum(unigrams.values()) * weights[0] + unigrams[char] * weights[1] + bigrams[context1] * weights[2]" does? Why are we multiplying unigrams[char] and bigrams[context] by weights[1] and weights[2] instead of weights[0] and weights[1] like we did in the "num" variable? — Stanleyrr, Sep 11 '19 at 16:12
In fact, I am attempting there to calculate a "micro average" of 1-gram, 2-gram, and 3-gram language models. Each model is multiplied by its weight, and for each momend I include count of its n-grams in the numerator, and count of the corresponding (n-1)-grams (its prefix) into the denominator. Of course, you could use a simpler model (e.g 3-grams only), but in my experience, this approach works slightly better. — David Dale, Sep 11 '19 at 17:03
what is the reason you apply "weights[1]" to unigrams[char] in the denominator while applying "weights[0]" to unigrams[char] in the numerator? — Stanleyrr, Sep 11 '19 at 20:52
@Stanleyrr Please note that I have fixed some typos in my answer. You can view my ratio as a weighted average of three ratios: `unigrams[char] / sum(unigrams.values())`, `bigrams[context1+char] / unigrams[context1]`, and `trigrams[context2+char] / bigrams[context2]`. These ratios represent respectively unconditional frequency of a char, and its frequencies conditional on one or two previous characters. — David Dale, Feb 14 '21 at 16:09

Unable to detect gibberish names using Python

1 Answers1

Linked