Solving a substitution cipher with python

Question

I know similar questions have been asked, but this is kind of a trivial case.

Given a text file endcoded with a substitution cipher, I need to decode it using python. I am not given any examples of correctly deciphered words. The relationship is 1-to-1 and case doesn't make a difference. Also, punctuation isn't changed and spaces are left where they are. I don't need help with the code as much as I need help with a general idea of how this could be done in code. My main approaches involve:

Narrowing down the choices by first solving 1, 2 or 3 character words.
I could use an list of English words of different sizes to compare.
I could use frequency distributions of the letters.

Does anyone have an idea of a general approach I could take to do this?

This is a question about cryptanalysis rather than about programming...that being the case, it's off-topic. — razlebe, Apr 07 '11 at 20:37
No, it's on-topic. A substitution cipher is not that hard to understand, so it's not some abstract crypto-thing. — Blender, Apr 07 '11 at 20:42
It really isn't about cryptanalysis as much as it's about a effective programming implementation. I thought a 1-to-1 substitution cipher was easy enough to understand. — Championcake, Apr 07 '11 at 21:00
It isn't a case of not understanding the topic. I guess for me, the question is about the algorithm rather than implementing it. — razlebe, Apr 07 '11 at 21:03

score 1 · Answer 1 · answered Apr 07 '11 at 20:44

I would first get a list of English words for reference. Next construct a list of possible 2 and 3 letter words. Then just start testing those small words in your cipher. Once you guess at a small word, check the larger words against your word list. If some of the words no longer have possible completions in the list, you're on the wrong track. If a word only has one possible completion, accept it as correct and continue. Eventually, you'll either reach a solution where all words are in your English word list, or you'll reach a point where there is no solution for a word.

This is exactly what I need to do. Thanks for enunciating it like this. I think that's mostly what I needed. — Championcake, Apr 07 '11 at 21:01

nmichaels · Answer 2 · 2011-04-07T21:04:54.383

1

I wrote something like this for when Haley's speech was all garbled. It wasn't automagic though; it made guesses based on etaoinshrdlu (the most frequently used letters in English, sorted most to least) and let the user interactively change the meaning of a given ciphertext letter.

So it would show you something like:

t0is is a 12eat 34556e!

and you'd manually guess what letter each number represented until you had something legible.

The advantage of this approach is that it can tolerate typos. If your encryptor makes any errors (or uses any words not in your dictionary in the plaintext) you may find yourself with an unsolveable puzzle.

That said, spell checkers have great lists of English words. I used the one in Debian's dictionaries-common package for my hangman solver.

edited Apr 07 '11 at 21:04

answered Apr 07 '11 at 20:44

nmichaels

49,466
12
107
135

Yeah, I was thinking about an approach like this. The real problem arises when the program makes a mistake and doesn't know to backtrack. – Championcake Apr 07 '11 at 20:58
@Championcake: My slapped-together one let you change letters that had already been assigned (or assign them first) then re-do the frequency analysis guessing bit. I wonder if I still have that code anywhere. It was 3 or 4 hard drives ago... – nmichaels Apr 07 '11 at 21:02

score 1 · Answer 3 · answered Apr 07 '11 at 21:52

You could try this approach:

Store a list of valid words (in a dictionary) and a "normal" letter distibution for your language (in a list).
Calculate the distribution of the letters in the garbled text.
Compare your garbled distribution with the normal one and regarble your text according to that.
Repeat: Set an array (rank) from all 26 letters to float (rank('A')=rank('B')=...=rank('Z')=0.0)
Check the words in the produced text against words in the dictionary. If a word is in the dictionary, raise the rank of that word's letters (something like: add a standard value, say 1.0). In other words calculate Score (a function of total rank and number of words in dictionary).
Save text into High score table (if score high enough).
If all words are in the dictionary or if the total rank is high enough or if the loop was done more than 10000 times, End.
If not, choose randomly two letters and interchange them. But with a deviated distribution, letters with high rank should have less chances of being interchanged.
Repeat.
End: Print High score texts.

The procedure resembles Simulated Annealing

Solving a substitution cipher with python

3 Answers3

Linked