Which algorithms most reliably solve substitution ciphers?

Question

I am studying a problem which reduces to the cryptanalysis of a lengthy monoalphabetic substitution ciphertext written in a known language. This problem is easy enough to be solved by hand using frequency analysis and word patterns, as described in Sinkov's Elementary Cryptanalysis. I'm having trouble finding a theoretically validated algorithm for it: Joux's Algorithmic Cryptanalysis doesn't even cover such rudimentary substitution, and I got nothing out of Gaines's Cryptanalysis: A Study of Ciphers and Their Solution (what other resources should I see?).

Some approaches are pretty obvious. Deciding on each substitution in sequence, then leveraging what is already known, works only if no mistake is made along the way. Employing metaheuristic optimization -- for example, reassign letters until the number of valid words found is maximized -- makes it hard to tell when the search is over. Perhaps a dynamic programming approach to testing the variations would be best. Or, the answers to this question contain other possibly naïve methods.

What are the preferred algorithms for solving this kind of problem?

score 2 · Answer 1 · answered Nov 17 '19 at 21:57

An exact maximum likelihood estimate for the independent letter model

If we assume (unrealistically) that each letter in a given language appears with some known probability independently of nearby letters, then the exact maximum likelihood estimate can be found fairly efficiently. This is about the best you could hope for with this model.

Let's suppose there are n letters in the message and k letters in the alphabet.

The basic idea is that we somehow come up with a numeric score for each of the k^2 possible letter pairs, and then look for a maximum-weight bipartite perfect matching on the two alphabets -- that is, we choose pairs of letters in such a way that every letter in one alphabet is paired with exactly one letter in the other, and the sum of the scores of the chosen pairs is maximised. There are 2 main issues: (1) How to define the score of a given pair; (2) How to solve the general problem of finding the highest-sum subset of pairs once we have decided on the individual scores.

Scoring pairs

The score of pairing the encoded letter x with the source letter y should be log(p(y)^freq(x)), where p(y) is the known probability of letter y in the language, and freq(x) is the number of times letter x appeared in the encoded input. Why? Because under this simple model of the source language that ignores the effect of nearby characters, the probability of generating any given n-character source-language string is equal to the product of the probabilities of the letters that appear in it, which you can also compute as the product of p(y)^freq(y) over all the letters y in the source alphabet. So to compute the probability of generating that string given that each x is "really" a y (and nothing else is a y), we change freq(y) to freq(x). If we then take logs, what we have is the sum of log(p(y)^freq(x)) over all letters y in the source alphabet -- and this sum is exactly what the maximum bipartite perfect matching algorithm will maximise.

Finding the optimal perfect matching

There are algorithms that solve the closely related maximum-weight bipartite matching problem in O(V^2E) time, which amounts to O(k^4) time here. This version of the problem looks to maximise the total sum of all chosen pairs while obeying the constraint that no item is matched with two or more other items in the other set, but without the requirement that every item is matched with some other item. Fortunately, we can turn this algorithm into an algorithm that solves the "perfect" variation we want to solve very simply, by adding a large number M to each score: Intuitively, this causes the algorithm to try to add as many edges as possible (since omitting even a single edge has a large cost, making such a solution much worse than even a "stupid" solution that contains a full set of k edges), while still forcing the algorithm to choose the best among all the k-edge solutions (since all of these solutions include the same kM contribution from the added weight).

Thank you! This naive approach would probably work but I don't understand yet why it's the best option. Wouldn't a more context-aware system, e.g. based on digrams or word patterns, be even more accurate on unusual inputs? — Aaron Brick, Nov 19 '19 at 17:39
It certainly would, but you can easily approximate e.g. digrams by just breaking the input into non-overlapping pairs of letters, effectively changing the alphabet to one of size 26*26=676. This discards information from half of the pairs -- the ones that begin at an even-numbered position -- but that won't matter if you have a lot of data. — j_random_hacker, Nov 19 '19 at 21:58
If you're prepared to assume that the input is definitely a sequence of legal words, a different approach would be to explore a tree where the first node has 26 children corresponding to all possible ways of mapping the first code letter to a source letter; each child has 25 children corresponding to the ways of mapping the next code letter that is distinct from the first (usually the second) to a source letter that produces a 2-character string that is a prefix of some sequence of legal words; etc. The tree has depth equal to the number of distinct code letters; you're trying to reach a leaf. — j_random_hacker, Nov 19 '19 at 22:17

Markus Amalthea Magnuson · Answer 2 · 2020-05-12T17:47:26.887

This paper describes an algorithm to solve homophonic substitution ciphers of which alphabet of length 26 (a-z) is a special case. Their reference implementation in C++ is available here.

The basic idea behind the algorithm (hill climb using digram frequencies to avoid having to decipher more than once) comes from this paper and is (by the authors of the first paper) described as the fastest known algorithm to solve monoalphabetic substitution ciphers.

Which algorithms most reliably solve substitution ciphers?

2 Answers2

An exact maximum likelihood estimate for the independent letter model

Scoring pairs

Finding the optimal perfect matching