I would like to decipher texts based on frequency analysis. Programming is not the problem, but there are some mathematical difficulties.
(No worries, not for hacking, I want to have a go at the Zodiac 340 cipher, but the question is just in general about deciphering http://zodiackillerciphers.com/wiki/images/7/7d/340-cipher-hi-resolution.jpg, not about the other problems with the cipher.)
I've broken it down to 5 brief questions all related to the cost function to show my effort, short answers are fine, any help appreciated. My problem is that the differences of the values in the cost function is very small.
Given
- A text with any numbers of symbols, called the cipher from now on. The cipher is in English. Each symbol in the cipher stands only for one letter but one letter might be expressed through several symbols. We don't know if there are any spaces (but the String that has to be evaluated by the cost function will be space-delimited and has only letters A-Z).
- Letter frequency analysis (A-Z and space) for: single letter, letter pairs and letter triplets. The 4000 most common words in English or "all" words using sowpods scrabble dictionary.
Questions about frequency analysis:
- Is it better just to check for the most common words or for all words using sowpods (maybe removing 2 and 3 letter words that are not in the 4000 most common words)?
- For the letter pairs and triplets: Is it better to store just their frequency over all, or store it in the form P(A|B) (probability an A is following a B) and P(C|AB) for triplets?
Concept
Skip if not interested. I don't want to go into much details here, there are several methods that could be used. A rough sketch:
- Generate a (semi-)random solution
- Local optimization of the solution based on a cost functon
- Start over and transfer some of the knowledge acquired
- After stagnation for some time try the same thing with the introduction of spaces at fixed positions before the local optimization (in case the message has no spaces)
- compare the 2 retrieved solutions and return the better one
Cost function
How would a cost function look like? The general one could be expressed as:
w1 * letterCost + w2 * pairCost + w3 * tripletCost + w4 * wordCost
and the sum of all wheights is one:
w1 + w2 + w3 + w4 = 1
Questions about the cost function
Now with the simple frequencies ignoring words (w4 = 0) you could just count the frequencies and take the squared difference (this is what I'm currently doing). What I wonder here is: Is it more reasonable to have w1 = w2 = w3 or w1 = 27 * w2 = 27 * 27 * w3 ?
How would it work with the conditional probabilities?
How do you incorporate the knowledge about the words? Just count how many real English words there are, probably weighting them by their length or is there a more intelligent way?