Brute-Force language detection

Question

I need an algorithm (any programming language) to test the vitality with an hill climbing algorithm for breaking a cipher for a crypto challenge. The algorithm should test how likely it is that an random-decryption (has no spaces) is an English text (also giving points for yet incomplete words!) or just a random sequence of characters.

I tried it with several algorithms I developed but they were not so good.

My research:

An enigma M4 crypto project ( http://www.bytereef.org/m4_project.html ) uses the Sinkov statistics, which I want to use, too.

The only thing I found was a document of «quebra -pedra», a Java framework that includes the Sinkov log-weight analysis I am searching for.

http://www.google.com/m?client=ms-android-samsung&source=android-home#q=Quebra-pedra+framework+java

But I have not found where to download the framework. Also I have not found any implementation or description of the Sinkov test.

I would be glad for any hints. Thanks.

score 6 · Answer 1 · answered Oct 17 '11 at 23:59

6

I don't know about Sinkov statistics, but language models from natural language processing can do exactly what you want, scoring text by how similar it is to English.

I wrote a simple character bigram one here, it should be reasonably easy to follow.

https://github.com/rrenaud/Gibberish-Detector

answered Oct 17 '11 at 23:59

Rob Neuhaus

9,190
3
28
37

Thanks for your comment. I did not know the Markov chains, but I had the same idea about calculating score for bigrams. The problem I had with my own formula, was that the distance between real text and garbage was very small. I will look at your code to learn more about the Markov chains. – Daniel Marschall Oct 18 '11 at 00:09
1

My colleague translated rrenaud's project to Java, in case this is useful to anyone https://github.com/paypal/Gibberish-Detector-Java – Eyal Aug 07 '16 at 08:08

Brute-Force language detection

1 Answers1