5

I would like to create algorithm to distinguish the persons writing on forum under different nicknames.

The goal is to discover people registring new account to flame forum anonymously, not under their main account.

Basicaly I was thinking about stemming words they use and compare users according to similarities or these words.

Users using words

As shown on the picture there is user3 and user4 who uses same words. It means there is probably one person behind the computer.

Its clear that there are lot of common words which are being used by all users. So I should focus on "user specific" words.

Input is (related to the image above):

<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter

Output should be:

user1
user2
user3 = user4

I am doing this in Java but I want this question to be language independent.

Any ideas how to do it?

1) how to store words/users? What data structures?

2) how to get rid of common words everybody use? I have to somehow ignore them among user specific words. Maybe I could just ignore them because they get lost. I am afraid that they will hide significant difference of "user specific words"

3) how to recognize same users? - somehow count same words between each user?

I am very thankful for every advice in advance.

sch
  • 27,436
  • 3
  • 68
  • 83
Martin Nuc
  • 5,604
  • 2
  • 42
  • 48

3 Answers3

2

In general this is task of author identification, and there are several good papers like this that may give you a lot of information. Here are my own suggestions on this topic.

1. User recognition/author identification itself

The most simple kind of text classification is classification by topic, and there you take meaningful words first of all. That is, if you want to distinguish text about Apple the company and apple the fruit, you count words like "eat", "oranges", "iPhone", etc., but you commonly ignore things like articles, forms of words, part-of-speech (POS) information and so on. However many people may talk about same topics, but use different styles of speech, that is articles, forms of words and all the things you ignore when classifying by topic. So the first and the main thing you should consider is collecting the most useful features for your algorithm. Author's style may be expressed by frequency of words like "a" and "the", POS-information (e.g. some people tend to use present time, others - future), common phrases ("I would like" vs. "I'd like" vs. "I want") and so on. Note that topic words should not be discarded completely - they still show themes the user is interested in. However you should treat them somehow specially, e.g. you can pre-classify texts by topic and then discriminate users not interested in it.

When you are done with feature collection, you may use one of machine learning algorithm to find best guess for an author of the text. As for me, 2 best suggestions here are probability and cosine similarity between text vector and user's common vector.

2. Discriminating common words

Or, in latest context, common features. The best way I can think of to get rid of the words that are used by all people more or less equally is to compute entropy for each such feature:

entropy(x) = -sum(P(Ui|x) * log(P(Ui|x)))

where x is a feature, U - user, P(Ui|x) - conditional probability of i-th user given feature x, and sum is the sum over all users.

High value of entropy indicates that distribution for this feature is close to uniform and thus is almost useless.

3. Data representation

Common approach here is to have user-feature matrix. That is, you just build table where rows are user ids and columns are features. E.g. cell [3][12] shows normalized how many times user #3 used feature #12 (don't forget to normalize these frequencies by total number of features user ever used!).

Depending on features your are going to use and size of the matrix, you may want to use sparse matrix implementation instead of dense. E.g. if you use 1000 features and for every particular user around 90% of cells are 0, it doesn't make sense to keep all these zeros in memory and sparse implementation is better option.

ffriend
  • 27,562
  • 13
  • 91
  • 132
1

I recommend a language modelling approach. You can train a language model (unigram, bigram, parsimonious, ...) on each of your user accounts' words. That gives you a mapping from words to probabilities, i.e. numbers between 0 and 1 (inclusive) expressing how likely it is that a user uses each of the words you encountered in the complete training set. Language models can be stored as arrays of pairs, hash tables or sparse vectors. There are plenty of libraries on the web for fitting LMs.

Such a mapping can be considered a high-dimensional vector, in the same way documents are considered as vector in the vector space model of information retrieval. You can then compare these vectors by using KL-divergence or any of the popular distance metrics: Euclidean distance, cosine distance, etc. A strong similarity/small distance between two users' vectors might then indicate that they belong to one and the same user.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
0

how to store words/users? What data structures?

You probably have some kind of representation for the users and the posts that they have made. I think you should have a list of words, and a list corresponding to each word containing the users who use it. Something like:

<word: <user#1, user#4, user#5, ...> >

how to get rid of common words everybody use?

Hopefully, you have a set of stopwords. Why not extend it to include commonly used words from your forum? For example, for stackoverflow, some of the most frequently used tags' names should qualify for it.

how to recognize same users?

In addition to using similarity or word-frequency based measures, you can also try using interactions between users. For example, user3 likes/upvotes/comments each and every post by user8, or a new user doing similar things for some other (older) user in this way.

KK.
  • 783
  • 8
  • 20