-1

I built an application to suggest email addresses fixes, and I need to detect email addresses that are basically not real existing email addresses, like the following:

14370afcdc17429f9e418d5ffbd0334a@magic.com
ce06e817-2149-6cfd-dd24-51b31e93ea1a@stackoverflow.org.il
87c0d782-e09f-056f-f544-c6ec9d17943c@microsoft.org.il
root@ns3160176.ip-151-106-35.eu
ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h@outlook.com
h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312@gmail.com
test@454-fs-ns-dff4-xhh-43d-frfs.com

I could do multi regex checks, but I don't think I will hit the good rate % of the suspected 'not-real' email addresses, as I go to a specific regex pattern each time.

I looked in:
Javascript script to find gibberish words in form inputs
Translate this JavaScript Gibberish please?
Detect keyboard mashed email addresses

Finally I looked over this: Unable to detect gibberish names using Python
And It seems to fit my needs, I think. A script that will give me some score about the possibility of the each part of the email address to be a Gibberish (or not real) email address.

So what I want is the output to be:

const strings = ["14370afcdc17429f9e418d5ffbd0334a", "gmail", "ce06e817-2149-6cfd-dd24-51b31e93ea1a", 
                 "87c0d782-e09f-056f-f544-c6ec9d17943c", "space-max", "ns3160176.ip-151-106-35", 
                 "ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h", "outlook", "h-rt-dfg4-sv6-fg32-dsv5-vfd5-
                  ds312", "system-analytics", "454-fs-ns-dff4-xhh-43d-frfs"];

for (let i = 0; i < strings.length; i++) {
   validateGibbrish(strings[i]);
}

And this validateGibberish function logic will be similar to this python code:

from nltk.corpus import brown
from collections import Counter
import numpy as np

text = '\n'.join([' '.join([w for w in s]) for s in brown.sents()])

unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))

weights = [0.001, 0.01, 0.989]

def strangeness(text):
    r = 0
    text = '  ' + text + '\n'
    for i in range(2, len(text)):
        char = text[i]
        context1 = text[(i-1):i]
        context2 = text[(i-2):i]
        num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2] 
        den = sum(unigrams.values()) * weights[0] + unigrams[char] + weights[1] + bigrams[context1] * weights[2]
        r -= np.log(num / den)
    return r / (len(text) - 2)

So in the end I will loop on all the strings and get something like this:

"14370afcdc17429f9e418d5ffbd0334a"                  ->    8.9073
"gmail"                                             ->    1.0044
"ce06e817-2149-6cfd-dd24-51b31e93ea1a"              ->    7.4261
"87c0d782-e09f-056f-f544-c6ec9d17943c"              ->    8.3916
"space-max"                                         ->    1.3553
"ns3160176.ip-151-106-35"                           ->    6.2584
"ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h"               ->    7.1796
"outlook"                                           ->    1.6694
"h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312"                ->    8.5734
"system-analytics"                                  ->    1.9489
"454-fs-ns-dff4-xhh-43d-frfs"                       ->    7.7058

Does anybody have a hint how to do it and can help?
Thanks a lot :)

UPDATE (12-22-2020)

I manage to write some code based on @Konstantin Pribluda answer, the Shannon entropy calculation:

const getFrequencies = str => {
    let dict = new Set(str);
    return [...dict].map(chr => {
        return str.match(new RegExp(chr, 'g')).length;
    });
};

// Measure the entropy of a string in bits per symbol.
const entropy = str => getFrequencies(str)
    .reduce((sum, frequency) => {
        let p = frequency / str.length;
        return sum - (p * Math.log(p) / Math.log(2));
    }, 0);

const strings = ['14370afcdc17429f9e418d5ffbd0334a', 'or', 'sdf', 'test', 'dave coperfield', 'gmail', 'ce06e817-2149-6cfd-dd24-51b31e93ea1a',
    '87c0d782-e09f-056f-f544-c6ec9d17943c', 'space-max', 'ns3160176.ip-151-106-35',
    'ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h', 'outlook', 'h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312', 'system-analytics', '454-fs-ns-dff4-xhh-43d-frfs'];

for (let i = 0; i < strings.length; i++) {
    const str = strings[i];
    let result = 0;
    try {
        result = entropy(str);
    }
    catch (error) { result = 0; }
    console.log(`Entropy of '${str}' in bits per symbol:`, result);
}

The output is:

Entropy of '14370afcdc17429f9e418d5ffbd0334a' in bits per symbol: 3.7417292966721747
Entropy of 'or' in bits per symbol: 1
Entropy of 'sdf' in bits per symbol: 1.584962500721156
Entropy of 'test' in bits per symbol: 1.5
Entropy of 'dave coperfield' in bits per symbol: 3.4565647621309536
Entropy of 'gmail' in bits per symbol: 2.3219280948873626
Entropy of 'ce06e817-2149-6cfd-dd24-51b31e93ea1a' in bits per symbol: 3.882021446536749
Entropy of '87c0d782-e09f-056f-f544-c6ec9d17943c' in bits per symbol: 3.787301737252941
Entropy of 'space-max' in bits per symbol: 2.94770277922009
Entropy of 'ns3160176.ip-151-106-35' in bits per symbol: 3.1477803284561103
Entropy of 'ds4-f1g-54-h5-dfg-yk-4gd-htr5-fdg5h' in bits per symbol: 3.3502926596166693
Entropy of 'outlook' in bits per symbol: 2.1280852788913944
Entropy of 'h-rt-dfg4-sv6-fg32-dsv5-vfd5-ds312' in bits per symbol: 3.619340871812292
Entropy of 'system-analytics' in bits per symbol: 3.327819531114783
Entropy of '454-fs-ns-dff4-xhh-43d-frfs' in bits per symbol: 3.1299133176846836

It's still not working as expected, as 'dave coperfield' gets about the same points as other gibberish results.

Anyone else have better logic or ideas on how to do it?

Or Assayag
  • 5,662
  • 13
  • 57
  • 93
  • 12
    What if `14370afcdc17429f9e418d5ffbd0334a@domain.com` is a valid email? – evolutionxbox Dec 21 '20 at 13:30
  • @evolutionxbox I know it can be valid, but the possible chances that this email address is real are low, and I'm willing to take that chance. – Or Assayag Dec 21 '20 at 13:31
  • 3
    What exactly is your question? – old greg Dec 21 '20 at 13:33
  • 4
    `reallymyemail@gmail.com` could be fake too – Pointy Dec 21 '20 at 13:37
  • @Pointy I know, but I'm looking to solve specific cases of gibberish email addresses right now. – Or Assayag Dec 21 '20 at 13:38
  • 2
    I for one have such an email address, but I get what you want. This seems like something you could train/use an AI for. I don't know if coding it manually would make the cut, since there are always going to be strange exceptions. – 3limin4t0r Dec 21 '20 at 13:41
  • 2
    Btw This could conflict with “sign in with apple”. – evolutionxbox Dec 21 '20 at 13:50
  • @evolutionxbox I'm aware, I'm not going for 3 party auth. – Or Assayag Dec 21 '20 at 13:51
  • 1
    @OrAssayag the [tag:gibberish] tag was previously deleted by community's decision, see https://meta.stackoverflow.com/questions/344165/this-tag-is-literally-gibberish. I think you tag edits should be reverted. Is it ok? – Vadim Kotov Dec 25 '20 at 09:50
  • 1
    Please do not create meta tags. “Gibberish” was [burninated before](https://meta.stackoverflow.com/questions/344165/this-tag-is-literally-gibberish), I’ll be removing it again. – Martijn Pieters Dec 25 '20 at 10:49
  • 1
    If you just want to check if email exists you should use an API. Send request with XMLHttpRequest() and get results. Checking if email is gibberish is bad idea (my email is only consonants) – Rocket Nikita Dec 25 '20 at 11:09
  • 1
    ... also check out [gibberish-detector.js](https://github.com/gtomitsuka/gibberish-detector.js/) – Rocket Nikita Dec 25 '20 at 11:24
  • @RocketNikita I checked the gibberish detector, it's not good enough. Thanks. – Or Assayag Dec 25 '20 at 11:58
  • 1
    @OrAssayag Hello, Did you think using some e-mail verification APIs like hunter or mailboxlayer? Maybe this will be more stable. – oguzhancerit Dec 25 '20 at 12:37
  • @oguzhancerit I want to detect these gibberish email addresses before I send them to any 3th party service for verification, to save a call. – Or Assayag Dec 26 '20 at 08:31

2 Answers2

4

This is what I come up with:

// gibberish detector js
(function (h) {
    function e(c, b, a) { return c < b ? (a = b - c, Math.log(b) / Math.log(a) * 100) : c > a ? (b = c - a, Math.log(100 - a) / Math.log(b) * 100) : 0 } function k(c) { for (var b = {}, a = "", d = 0; d < c.length; ++d)c[d] in b || (b[c[d]] = 1, a += c[d]); return a } h.detect = function (c) {
        if (0 === c.length || !c.trim()) return 0; for (var b = c, a = []; a.length < b.length / 35;)a.push(b.substring(0, 35)), b = b.substring(36); 1 <= a.length && 10 > a[a.length - 1].length && (a[a.length - 2] += a[a.length - 1], a.pop()); for (var b = [], d = 0; d < a.length; d++)b.push(k(a[d]).length); a = 100 * b; for (d = b =
            0; d < a.length; d++)b += parseFloat(a[d], 10); a = b / a.length; for (var f = d = b = 0; f < c.length; f++) { var g = c.charAt(f); g.match(/^[a-zA-Z]+$/) && (g.match(/^(a|e|i|o|u)$/i) && b++, d++) } b = 0 !== d ? b / d * 100 : 0; c = c.split(/[\W_]/).length / c.length * 100; a = Math.max(1, e(a, 45, 50)); b = Math.max(1, e(b, 35, 45)); c = Math.max(1, e(c, 15, 20)); return Math.max(1, (Math.log10(a) + Math.log10(b) + Math.log10(c)) / 6 * 100)
    }
})("undefined" === typeof exports ? this.gibberish = {} : exports)

// email syntax validator
function validateSyntax(email) {
    return /^(([^<>()[\]\\.,;:\s@"]+(\.[^<>()[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/.test(email.toLowerCase());
}

// shannon entropy
function entropy(str) {
    return Object.values(Array.from(str).reduce((freq, c) => (freq[c] = (freq[c] || 0) + 1) && freq, {})).reduce((sum, f) => sum - f / str.length * Math.log2(f / str.length), 0)
}

// vowel counter
function countVowels(word) {
    var m = word.match(/[aeiou]/gi);
    return m === null ? 0 : m.length;
}

// dummy function
function isTrue(value){
    return value
}

// validate string by multiple tests
function detectGibberish(str){
    var strWithoutPunct = str.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");

    var entropyValue = entropy(str) < 3.5;
    var gibberishValue = gibberish.detect(str) < 50;
    var vovelValue = 30 < 100 / strWithoutPunct.length * countVowels(strWithoutPunct) && 100 / strWithoutPunct.length * countVowels(str) < 35;
    return [entropyValue, gibberishValue, vovelValue].filter(isTrue).length > 1
}

// main function
function validateEmail(email) {
    return validateSyntax(email) ? detectGibberish(email.split("@")[0]) : false
}

// tests
document.write(validateEmail("dsfghjdhjs@gmail.com") + "<br/>")
document.write(validateEmail("jhon.smith@gmail.com"))

I have combined multiple tests: gibberish-detector.js, Shannon entropy and counting vowels (between 30% and 35%). You can adjust some values for more accurate result.

Rocket Nikita
  • 470
  • 2
  • 7
  • 20
  • Thanks for your time and effort writing this function for me. Although it's not working 100% as I expected, it's good enough for my needs. – Or Assayag Dec 29 '20 at 08:36
2

A thing you may consider doing is checking each time how random each string is, then sort the results according to their score and given a threshold exclude the ones with high randomness. It is inevitable that you will miss some.

There are some implementations for checking the randomness of strings, for example:

You may have to create a hash (to map chars and symbols to sequences of integers) before you apply some of these because some work only with integers, since they test properties of random numbers generators.

Also a stack exchange link that can be of help is this:

PS. I am having a similar problem in a service since robots create accounts with these type of fake emails. After years of dealing with this issue (basically deleting manually from the DB the fake emails) I am now considering introducing a visual check (captcha) in the signup page to avoid the frustration.

pebox11
  • 3,377
  • 5
  • 32
  • 57
  • 1
    Thanks for this, friend. I'm doing this for my private project, not for any sign-up or something. I'm considering maybe to write a new NPM package for this. Your info links could be useful. Thanks again. – Or Assayag Dec 28 '20 at 20:22