Learning names of spammers

Question

Currently, some spam waves, especially when sport events happen, are flooding the internet.

As I strongly doubt that the usernames of the spammers aren't computer generated, I thought it might be interesting to try learning spammer names programatically somehow.

A user name should be between 2 and 15 characters, begin with a letter and contain only letters, numbers, _ or -.

A sample list of names would be

riazsports0171
maya34444
thelmaeatons
tigran777
newlive100
darbeshbaba
litondina10
nithuhasan
newlive100
bankuali
lldztwydni554
monomala505
nasiruddin1500
lldztwydni554
ariful3032
nazmulhasan

I do only have a fairly basic knowledge of algorithms (from university). My question is, which machine learning algorithms and/or string metrics I could use for predicting if an arbitary username is probably a spammer or not. I thought about using cosine string similaritz, because its fairly simple.

You would need to have a dataset of names where each name is labelled as belonging to a spammer or not. — Sicco, Sep 03 '12 at 14:19
what other information you have ?! the content or only these names ?! I agree the string similarity is the easiest and fairly the best option ! — Areza, Sep 03 '12 at 14:26
Scraping the contents would be very tedious, I would prefer only having to deal with names. I got far more of these spammer usernames, I just wanted to provide an example how it looks like. Btw I'm not a spammer :) — user3001, Sep 03 '12 at 15:15

amit · Accepted Answer · 2012-09-03T14:56:54.480

Interesting. But I don't think string similarity algorithms are the best solution.

I'd try to extract features from the names, and use a classification algorithm. SVM usually provides very good results comparing to other classification algorithms, but there are other algorithms as well (For example: Naive Bayes, Decision Tree, KNN) each with its advantages and disadvantages.

The tricky part will be to extract the features. You should be creative. Some options are: number of digits, number of consecutive letters, number of consecutive consonant, usage of capitalization, correct usage of capitalization, is matching a certain regex, ... (You could also use other features not from the string, such as number of msgs sent by this user to you, ....)

Next, you need to create a training set. This training set will contain both spammers and non-spammers user names, which are manually labeled for spammers or non-spammers.

Feed the training set to your algorithm of choice, and it will create a classifier, which you will be able to use to predict if new users are spammers or not.

You can evaluate effectiveness of each algorithm by using cross validation on your data.

That should give me some good points to start from, thank you very much! — user3001, Sep 03 '12 at 15:16

Learning names of spammers

1 Answers1