I built an application to suggest email addresses fixes, and I need to detect email addresses that are basically not real existing email addresses, like the following:
14370afcdc17429f9e418d5ffbd0334a@magic.com
ce06e817-2149-6cfd-dd24-51b31e93ea1a@stackoverflow.org.il
87c0d782-e09f-056f-f544-c6ec9d17943c@microsoft.org.il
root@ns3160176.ip-151-106-35.eu
ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h@outlook.com
h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312@gmail.com
test@454-fs-ns-dff4-xhh-43d-frfs.com
I could do multi regex checks, but I don't think I will hit the good rate % of the suspected 'not-real' email addresses, as I go to a specific regex pattern each time.
I looked in:
Javascript script to find gibberish words in form inputs
Translate this JavaScript Gibberish please?
Detect keyboard mashed email addresses
Finally I looked over this:
Unable to detect gibberish names using Python
And It seems to fit my needs, I think. A script that will give me some score about the possibility of the each part of the email address to be a Gibberish (or not real) email address.
So what I want is the output to be:
const strings = ["14370afcdc17429f9e418d5ffbd0334a", "gmail", "ce06e817-2149-6cfd-dd24-51b31e93ea1a",
"87c0d782-e09f-056f-f544-c6ec9d17943c", "space-max", "ns3160176.ip-151-106-35",
"ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h", "outlook", "h-rt-dfg4-sv6-fg32-dsv5-vfd5-
ds312", "system-analytics", "454-fs-ns-dff4-xhh-43d-frfs"];
for (let i = 0; i < strings.length; i++) {
validateGibbrish(strings[i]);
}
And this validateGibberish
function logic will be similar to this python code:
from nltk.corpus import brown
from collections import Counter
import numpy as np
text = '\n'.join([' '.join([w for w in s]) for s in brown.sents()])
unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))
weights = [0.001, 0.01, 0.989]
def strangeness(text):
r = 0
text = ' ' + text + '\n'
for i in range(2, len(text)):
char = text[i]
context1 = text[(i-1):i]
context2 = text[(i-2):i]
num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2]
den = sum(unigrams.values()) * weights[0] + unigrams[char] + weights[1] + bigrams[context1] * weights[2]
r -= np.log(num / den)
return r / (len(text) - 2)
So in the end I will loop on all the strings and get something like this:
"14370afcdc17429f9e418d5ffbd0334a" -> 8.9073
"gmail" -> 1.0044
"ce06e817-2149-6cfd-dd24-51b31e93ea1a" -> 7.4261
"87c0d782-e09f-056f-f544-c6ec9d17943c" -> 8.3916
"space-max" -> 1.3553
"ns3160176.ip-151-106-35" -> 6.2584
"ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h" -> 7.1796
"outlook" -> 1.6694
"h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312" -> 8.5734
"system-analytics" -> 1.9489
"454-fs-ns-dff4-xhh-43d-frfs" -> 7.7058
Does anybody have a hint how to do it and can help?
Thanks a lot :)
UPDATE (12-22-2020)
I manage to write some code based on @Konstantin Pribluda answer, the Shannon entropy calculation:
const getFrequencies = str => {
let dict = new Set(str);
return [...dict].map(chr => {
return str.match(new RegExp(chr, 'g')).length;
});
};
// Measure the entropy of a string in bits per symbol.
const entropy = str => getFrequencies(str)
.reduce((sum, frequency) => {
let p = frequency / str.length;
return sum - (p * Math.log(p) / Math.log(2));
}, 0);
const strings = ['14370afcdc17429f9e418d5ffbd0334a', 'or', 'sdf', 'test', 'dave coperfield', 'gmail', 'ce06e817-2149-6cfd-dd24-51b31e93ea1a',
'87c0d782-e09f-056f-f544-c6ec9d17943c', 'space-max', 'ns3160176.ip-151-106-35',
'ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h', 'outlook', 'h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312', 'system-analytics', '454-fs-ns-dff4-xhh-43d-frfs'];
for (let i = 0; i < strings.length; i++) {
const str = strings[i];
let result = 0;
try {
result = entropy(str);
}
catch (error) { result = 0; }
console.log(`Entropy of '${str}' in bits per symbol:`, result);
}
The output is:
Entropy of '14370afcdc17429f9e418d5ffbd0334a' in bits per symbol: 3.7417292966721747
Entropy of 'or' in bits per symbol: 1
Entropy of 'sdf' in bits per symbol: 1.584962500721156
Entropy of 'test' in bits per symbol: 1.5
Entropy of 'dave coperfield' in bits per symbol: 3.4565647621309536
Entropy of 'gmail' in bits per symbol: 2.3219280948873626
Entropy of 'ce06e817-2149-6cfd-dd24-51b31e93ea1a' in bits per symbol: 3.882021446536749
Entropy of '87c0d782-e09f-056f-f544-c6ec9d17943c' in bits per symbol: 3.787301737252941
Entropy of 'space-max' in bits per symbol: 2.94770277922009
Entropy of 'ns3160176.ip-151-106-35' in bits per symbol: 3.1477803284561103
Entropy of 'ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h' in bits per symbol: 3.3502926596166693
Entropy of 'outlook' in bits per symbol: 2.1280852788913944
Entropy of 'h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312' in bits per symbol: 3.619340871812292
Entropy of 'system-analytics' in bits per symbol: 3.327819531114783
Entropy of '454-fs-ns-dff4-xhh-43d-frfs' in bits per symbol: 3.1299133176846836
It's still not working as expected, as 'dave coperfield' gets about the same points as other gibberish results.
Anyone else have better logic or ideas on how to do it?