Algorithm to determine probable language of a text

Question

I'm searching for a simple algorithm or an open source library (PHP) allowing to estimate whether a text mainly uses a specific language. I found the following answer relating to Python, which probably leads in the right direction. But something working out-of-the-box for PHP would be a charm.

Of course something like an n-gram estimator wouldn't be too hard to implement, but it requires a reference database as well.

The actual problem to solve is as follows. I run a WordPress blog, which currently is flooded by SPAM. The blog is in German language and virtually all trackback spam is English. My idea is to inmmediately spam all trackbacks seeming to be English. However, I cannot use marker words, because I do not want to spam typos or citations.

My solution:

Using the answers to this question I implemented a solution, which detects German by a simple stopword ratio. Any comment must contain at least 25% German stopwords, if it has a link. So you can still comment something like "cool article", which has no stopwords at all, but if you put a link, you should bother to write proper language.

Unfortunately the stopwords from NLTK are incorrect. The list contains words, which do not exist in German. So I used the snowball list. Using the Perl regexp optimizer I condensed the entire list into a single regexp and count the stopwords using preg_match_all(). The whole filter is 25 lines, a third of the Perl code to produce the regexp from the list. Let's see how it performs in the wild.

Thanks for your help.

You can get lots of data by downloading stuff from project-gutenberg for various languages. However what you're searching for is a spam classifier- depending on how much spam you gathered through your blog, this can be a pretty easy task. Maybe you want to update your question with more of these information. — Thomas Jungblut, Jun 13 '13 at 19:21
@jraede Akismet has legal issues. I do not want to pass legitimate comments through any foreign server. Otherwise I'd have to put a privacy statement into the comment form, which may scare off real commenters. — Lars Hanke, Jun 13 '13 at 20:28
@ThomasJungblut The current SPAM/HAM ratio is about 35! I fear that Bayes solutions might tend to false positives. Neither do I want to spam any comment, which holds any of the typical drug names. It is well possible that real commenters use these words. And I want to spam the comments immediately. I currently have them in the moderation, but kicking 60 comments per day is a boring task. — Lars Hanke, Jun 13 '13 at 20:45
Related question: [how-to-compute-letter-frequency-similarity](http://stackoverflow.com/q/15710292/1988505) — Wesley Baugh, Jun 14 '13 at 00:00

score 1 · Accepted Answer · answered Jun 13 '13 at 19:33

1

I agree with @Thomas that what you are looking for is a spam classifier rather than a language detection algorithm. Nonetheless, I think this language detection solution is simple enough and out of the box as you want. Basically if you count the number of stop-words in different languages and select the language with higher number of them in the document you have a simple, yet very effective language classifier.

Now, the best part is that you do not need to code almost anything as you can used standard stop-words list and processing packages like nltk to deal with the information. Here you have the example of how to implement it from scratch with Python and nltk.

I hope this helps.

answered Jun 13 '13 at 19:33

miguelmalvarez

920
6
15

A nice link. Shouldn't be too hard to extract the stop word lists from Python into PHP and implement the algorithm in PHP. If there is no simpler solution until the weekend, I'd give it a try. – Lars Hanke Jun 13 '13 at 20:51
You should be able to call the python implementation from PHP and parse the result. It might be quicker than re-implement it. – miguelmalvarez Jun 13 '13 at 21:12
Might be an idea. My hoster officially does not offer Python, but I can run Python from a shell. It'll depend on how many modules I'll need to install locally. – Lars Hanke Jun 13 '13 at 21:30
mmm, if you really do not want to use anything but PHP and focus only on detecting english (as I understood is your main goal), my suggestion is to download a list of stop from [here](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) and to check via PHP about the count of such words. Then you decide if a specific number of these words appear in the document, you classify it as english. – miguelmalvarez Jun 14 '13 at 07:33

score 0 · Answer 2 · answered Jun 13 '13 at 19:27

If all you want to do is recognize English then there's a very easy hack. If you just check for the letters in a post, English is one of the only languages that will be entirely in the pure-ASCII range. It's hacky, but it's a decently simplification on an otherwise very difficult problem I believe.

My guess on efficacy, just doing some quick back on the envelope calculations on a couple French and German blogs would be ~85%, which isn't foolproof, but is pretty good for the simplicity of it I would think.

Many comments tend to be short, and I have valid German comments, which are pure-ASCII. Nothing I'd want to spam unseen. — Lars Hanke, Jun 13 '13 at 20:48

Algorithm to determine probable language of a text

2 Answers2