Translating Morse code with no spaces

Question

I have some Morse code that has lost the spaces in between the letters, my challenge is to find out what the message says. So far I have been kinda lost because of the sheer amount of combinations there might be.

Here is all the info on the messages I have.

The output will be English
There will always be a translation that make sense
Here is and example message -..-...-...-...-..-.-.-.-.-..-.-.-.-.-.-.-.-.-.-..-...-.
The messages should be no longer then 70 characters
The morse code was taken from a longer stream so it is possible that the first or last groups may be cut off and hence have no valid translation

Does anyone have a clever solution?

There isn't a clever solution. Work out the combinations and check them against a dictionary. If there's more than one correct decoding of the entire message you have no way, without further information, of knowing which is correct. — Endophage, Dec 01 '11 at 22:24
I'm almost certain you could find *a* valid interpretation using a massive regexp containing all valid Morse letters, such that it would backtrack when it hit an invalid sequence until it built up a valid one; however, I can't offhand think of an elegant way to get all possible translations, and I'm fairly certain that there will be many ambiguous potential translations that are valid, but nonsensical. — BRPocock, Dec 01 '11 at 22:32
I was thinking maybe running them through some code to check letter frequency to help determine witch might be correct. — giodamelio, Dec 01 '11 at 22:33
The part I'm stuck on it the "massive regexp containing all valid Morse letters". I don't have a clue on how to make that. — giodamelio, Dec 01 '11 at 22:34
How long is the message? Part of the problem is that for any sequence, a strictly valid, but entirely unhelpful decoding is to replace single dots with E and single dashes with T, but you'd end up with ETETTTEETETEEETE... What other info do you have, and are there any spaces at all? — FredL, Dec 01 '11 at 22:41
You might want to think about the way predictive text algorithms for mobile phones work..? Given a partial decoding, you could work out (from an analysis of lots of sample text) what letters were most likely to come next and direct any search based on this probability. You could also look at the frequency with which different words follow the one you've decoded. However you do it it's going to involve a big search, but there are probably lots of ways of improving your odds given the not-entirely-random nature of the english language :) — FredL, Dec 03 '11 at 19:08

score 8 · Accepted Answer · answered Dec 04 '11 at 13:07

This is not an easy problem, because as ruakh suggested there are many viable sentences to a given message. For example 'JACK AND JILL WENT UP THE HILL' has the same encoding as 'JACK AND JILL WALK CHISELED'. Since these are both grammatical sentences and the words in each are common, it's not obvious how to pick one or the other (or any other of the 40141055989476564163599 different sequences of English words that have the same encoding as this message) without delving into natural language processing.

Anyway, here's a dynamic programming solution to the problem of finding the shortest sentence (with the fewest characters if there's a tie). It can also count the total number of sentences that have the same encoding as the given message. It needs a dictionary of English words in a file.

The next enhancements should be a better measure of how likely a sentence is: perhaps word frequencies, false-positive rates in morse (eg, "I" is a common word, but it appears often as part of other sequences of morse code sequences). The tricky part will be formulating a good score function that can be expressed in a way that it can be computed using dynamic programming.

MORSE = dict(zip('ABCDEFGHIJKLMNOPQRSTUVWXYZ', [
    '.-', '-...', '-.-.', '-..', '.', '..-.', '--.', '....',
    '..', '.---', '-.-', '.-..', '--', '-.', '---', '.--.',
    '--.-', '.-.', '...', '-', '..-', '...-', '.--', '-..-',
    '-.--', '--..'
]))

# Read a file containing A-Z only English words, one per line.
WORDS = set(word.strip().upper() for word in open('dict.en').readlines())
# A set of all possible prefixes of English words.
PREFIXES = set(word[:j+1] for word in WORDS for j in xrange(len(word)))

def translate(msg, c_sep=' ', w_sep=' / '):
    """Turn a message (all-caps space-separated words) into morse code."""
    return w_sep.join(c_sep.join(MORSE[c] for c in word)
                      for word in msg.split(' '))

def encode(msg):
    """Turn a message into timing-less morse code."""
    return translate(msg, '', '')

def c_trans(morse):
    """Construct a map of char transitions.

    The return value is a dict, mapping indexes into the morse code stream
    to a dict of possible characters at that location to where they would go
    in the stream. Transitions that lead to dead-ends are omitted.
    """
    result = [{} for i in xrange(len(morse))]
    for i_ in xrange(len(morse)):
        i = len(morse) - i_ - 1
        for c, m in MORSE.iteritems():
            if i + len(m) < len(morse) and not result[i + len(m)]:
                continue
            if morse[i:i+len(m)] != m: continue
            result[i][c] = i + len(m)
    return result

def find_words(ctr, i, prefix=''):
    """Find all legal words starting from position i.

    We generate all possible words starting from position i in the
    morse code stream, assuming we already have the given prefix.
    ctr is a char transition dict, as produced by c_trans.
    """
    if prefix in WORDS:
        yield prefix, i
    if i == len(ctr): return
    for c, j in ctr[i].iteritems():
        if prefix + c in PREFIXES:
            for w, j2 in find_words(ctr, j, prefix + c):
                yield w, j2

def w_trans(ctr):
    """Like c_trans, but produce a word transition map."""
    result = [{} for i in xrange(len(ctr))]
    for i_ in xrange(len(ctr)):
        i = len(ctr) - i_ - 1
        for w, j in find_words(ctr, i):
            if j < len(result) and not result[j]:
                continue
            result[i][w] = j
    return result

def shortest_sentence(wt):
    """Given a word transition map, find the shortest possible sentence.

    We find the sentence that uses the entire morse code stream, and has
    the fewest number of words. If there are multiple sentences that
    satisfy this, we return the one that uses the smallest number of
    characters.
    """
    result = [-1 for _ in xrange(len(wt))] + [0]
    words = [None] * len(wt)
    for i_ in xrange(len(wt)):
        i = len(wt) - i_ - 1
        for w, j in wt[i].iteritems():
            if result[j] == -1: continue
            if result[i] == -1 or result[j] + 1 + len(w) / 30.0 < result[i]:
                result[i] = result[j] + 1 + len(w) / 30.0
                words[i] = w
    i = 0
    result = []
    while i < len(wt):
        result.append(words[i])
        i = wt[i][words[i]]
    return result

def sentence_count(wt):
    result = [0] * len(wt) + [1]
    for i_ in xrange(len(wt)):
        i = len(wt) - i_ - 1
        for j in wt[i].itervalues():
            result[i] += result[j]
    return result[0]

msg = 'JACK AND JILL WENT UP THE HILL'
print sentence_count(w_trans(c_trans(encode(msg))))
print shortest_sentence(w_trans(c_trans(encode(msg))))

For users of UNIX-like systems, `/usr/share/dict/words` works quite well as dictionary (i.e. it can be used in place of `'dict.en'`). — David Cain, Mar 21 '14 at 18:33

score 0 · Answer 2 · answered Dec 01 '11 at 23:00

I don't know if this is "clever", but I would try a breadth-first search (as opposed to the depth-first search implicit in BRPocock's regex idea). Suppose your string looks like this:

.---.--.-.-.-.--.-...---...-...-..
J   A C   K  A N D  J   I L   L

You start out in state ('', 0) ('' being what you've decoded so far; 0 being your position in the Morse-code string). Starting from position zero, possible initial characters are . E, .- A, .-- W, .--- J, and .---- 1. So, push states ('E', 1), ('A', 2), ('W', 3), ('J', 4), and ('1', 5) onto your queue. After dequeuing state ('E', 1), you would enqueue states ('ET', 2), ('EM', 3), and ('EO', 4).

Now, your queue of possible states will grow quite quickly — both of { ., - } are letters, as are all of { .., .-, -., -- } and all of { ..., ..-, .-., .--, -.., -.-, --., --- }, so in each pass your number of states will increase by a factor of at least three — so you need to have some mechanism for user feedback. In particular, you need some way to ask your user "Is it plausible that this string starts with EOS3AIOSF?", and if the user says "no", you will need to discard state ("EOS3AIOSF", 26) from your queue. The ideal would be to present the user with a GUI that, every so often, shows all current states and lets him/her select which ones are worth proceeding with. ("The user" will also be you, of course. English has a shortage of pronouns: if "you" refers to the program, then what pronoun refers to the user-programmer?!)

You can omit half the pronouns you used. Replace "your" with "the" etc. Talk about the program in third person and yourself in first. — N_A, Dec 01 '11 at 23:10
@mydogisbox: Thanks. For the record, my complaint was tongue-in-cheek -- I am aware that a program can be referred to as "it" -- but I do appreciate the effort. :-) — ruakh, Dec 01 '11 at 23:23
Breadth first only beats depth first if there can be short branches in the search tree. Since every branch in this tree has exactly the same depth (the length of the input), this just has the effect of using worst-case exponentially more memory. — , Dec 01 '11 at 23:41
@PaulHankin: As I explained in my last paragraph, I'm relying on the user to trim/prune the branches. (You're doing the same thing, actually, but using a word-list instead. Probably it would be best to meld the two approaches: your logic would help the program can make guesses about what branches are more probable and propose likely word-breaks, while using a priority-queue as I suggested in my comment to your answer would allow it to suggest only the most-likely options to the user for branch-pruning.) — ruakh, Dec 01 '11 at 23:51

score 0 · Answer 3 · answered Dec 01 '11 at 23:16

0

Maintain 3 things: a list of words so far S, the current word so far W, and the current symbol C.

S should be only good words, eg. 'THE QUICK'
W should be a valid prefix of a word, eg. ['BRO']
C should be a valid prefix of some letter, eg. '.-'

Now, given a new symbol, let's say '-', we extend C with it (in this case we get '.--'). If C is a complete letter (in this case it is, the letter 'W'), we have a choice to add it to W, or to continue extending the letter further by adding more symbols. If we extend W, we have a choice to add it to S (if it's a valid word), or to continue extending it further.

This is a search, but most paths terminate quickly (as soon as you have W not being a valid prefix of any word you can stop, and as soon as C isn't a prefix of any letter you can stop).

To get it more efficient, you could use dynamic programming to avoid redundant work and use tries to efficiently test prefixes.

What might the code look like? Omitting the functions 'is_word' which tests if a string is an English word, and 'is_word_prefix' which tests if a string is the start of any valid word, something like this:

morse = {
    '.-': 'A',
    '-...': 'B',
    etc.
}

def is_morse_prefix(C):
    return any(k.startswith(C) for k in morse)

def break_words(input, S, W, C):
    while True:
        if not input:
            if W == C == '':
                yield S
            return
        i, input = input[0], input[1:]
        C += i
        if not is_morse_prefix(C):
            return
        ch = morse.get(C, None)
        if ch is None or not is_word_prefix(W + ch):
            continue
        for result in break_words(input, S, W + ch, ''):
            yield result
        if is_word(W + ch):
            for result in break_words(input, S + ' ' + W + ch, '', ''):
                yield result

for S in break_words('....--', [], '', ''):
    print S

answered Dec 01 '11 at 23:16

1

You added the condition that the start and end can be garbage after I posted. You can adapt the code to that by ignoring garbage at the start by trying input, input[1:], input[2:] until you find good words. For garbage at the end, you can ignore it by dropping the test that W==C=='' at the start, and just always yielding S. – Dec 01 '11 at 23:21
The tricky part about using `is_word` and `is_word_prefix` is that a message in Morse code can contain non-words, such as proper names. I would imagine that such functions would be more useful for estimating the probability of a given W than for rejecting a given W out-of-hand. This means that it wouldn't help your paths "terminate quickly" (though it might allow the possibility of using some sort of priority queue to focus on the paths that seem most likely to be fruitful). – ruakh Dec 01 '11 at 23:30
It's a variation of the code to produce the most probable output (it needs dynamic programming though to be even vaguely efficient). However, the condition given was that it was English, which I think it's reasonable to interpret to mean that the message contains only dictionary words, especially as this is an academic and not a practical program. – Dec 01 '11 at 23:33
Well, but if "most probable" is evaluated solely on the basis of "is W a word?", then you'll get lots of spurious I's (`..`) and A's (`.-`). I would guess that a large proportion of twenty-symbol Morse-code strings can be decoded as consisting solely of valid one- and two-letter words, even if a interpretations with longer words would make more sense to a human. – ruakh Dec 01 '11 at 23:44
Can you give an example ruakh? It seems to me that most times you extract 'A' or 'I' you'll leave nonsense on either side. – Dec 01 '11 at 23:53
I think you're underestimating just how much entropy Morse code has. Just the six words `.. I`, `.- A`, `-.. TI`, `-.- TA`, `--. ME`, and `--- O` cover all possible prefixes. By that approach, the example phrase in my answer, `JACK AND JILL`, is also `I ME ME TA I I TA A I ME A TI I A`. If you balk at `TA` and `TI` (both valid in Scrabble, but unlikely in telegrams), how about `I ME MET A I IT A A I ME AT I I A`? (Same letter-sequence, just changing `ME TA` to `MET A`, `I TA` to `IT A`, and `A TI` to `AT I`.) – ruakh Dec 02 '11 at 00:23

Translating Morse code with no spaces

3 Answers3