Morse without separators - best algorithm

Question

In Morse code there are dots and dashes in groups of 1-4 separated by a separator. Each group means one letter. Between words there are two separators. Between sentences three.

Application for decrypting basic Morse code is quite easy to make. But my question is, how to solve the problem, when there are no separators? I know that there will be a huge amount of nonsense result but that's not my point. I only need to get all possible results in the most efficient way.

This would be an input:

......-...-..---

And this would be one of many outputs:

hello

How would you do that?

Are some original texts more likely than others? (E.g. do you know that the original text is mostly English text?) If you know character probabilities (or even better, probabilities of pairs or triples of characters etc.) then using a Hidden Markov Model will give a *much* more informative output. You can e.g. determine the most probably overall decoding. — j_random_hacker, Jan 27 '16 at 04:19

score 3 · Accepted Answer · edited May 23 '17 at 12:22

After reading a dit or dah, you have two options: terminate the letter or continue the current letter. This will lead to a lot of bifurcations in your code and a recursive approach might be a good way to implement this.

Keep a buffer of the possible string so far and print (or store) the result when you hit the string end and it coincides with the end of a letter.

Here's an implementation in C:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static const char *letter = "**ETIANMSURWDKGOHVF*L*PJBXCYZQ**";

void morse_r(char *buf, int len, const char *str, int code)
{
    if (*str == '\0') {
        // end of string; print if staring new letter
        if (code == 1) printf("%.*s\n", len, buf);
    } else if (code < 16) {
        if (*str == '.') code = 2 * code;                    
        if (*str == '-') code = 2 * code + 1;

        // continue letter
        morse_r(buf, len, str + 1, code);

        // start new letter
        buf[len++] = letter[code];
        morse_r(buf, len, str + 1, 1);
    }
}

void morse(const char *str)
{
    char buf[strlen(str)];

    morse_r(buf, 0, str, 1);
}

int main()
{
    morse("......-...-..---");

    return 0;
}

This implementation is very simple. It uses a simplistic lookup mechanism and it doesn't check whether a letter actually exists. (The asterisks in the letter array are valid Morse codes, but they are not Latin letters.)

This approach is also rather brute force: It recalculates the tails over and over. Memoization of the tails will save a lot of extra work for the processor for loner strings.

And, as you are aware, there will be a ot of nonsense results. The above code yields 20569 strings (some of them with asterisks, i.e. invalid). You can prevent many recursions when you do a plausibility or dictionary check on your way. For example, many dots in a row will yield a lot of nonsense words with repeated Es.

Edit: As Jim Mischel points out, an explanation of how the Morse code lookup works is in order. Yves Daoust mentions a trie in his answer. A trie is a tree structure for storing words; each node can have as many children as there are letters in the alphabet. Morse code has only two letters: dit (.) and dah (-). A Morse code trie is therefore a binary tree.

Tries are usually sparse; words are rather long and many letter combinations don't exist. Morse tries are dense: Morse letter encodings are short and nearly every cobmination is used. The tree can be stored as linear, "flat" array, similar to a heap. A node is represented by its index i in the array. The left child is then 2*i and the right child 2*i + 1.

A better and more detailed explanation can be found in an answer I posted to another Morse-related question, from where I've taken the lookup code that I used in the example above.

An interesting algorithm. Good answer except that you should explain that your `letter` array is a binary tree stored in breadth-first order, and how your algorithm makes use of it. — Jim Mischel, Jan 25 '16 at 17:08
@JimMischel: Thanks. You are right, I've slapped the code here without really explaining it. I had the lookup routine still stitting somewhere from an older answer, so I just used it. I've added an explanation. — M Oehm, Jan 25 '16 at 19:28
This implicit heap representation is indeed quite appropriate here. — , Jan 25 '16 at 19:52

score 2 · Answer 2 · 2016-01-25T09:49:11.790

IMO the most efficient approach will be with a trie. This is a tree such that every node has up to two sons, one for . and one for -, when these characters are possible at the given stage. In addition to the links to the sons, a node has a "terminal" character telling what character the path from the root to this node encodes; the terminal character can be a zero to indicate that the path does not encode any character (the string isn't finished).

As the Morse alphabet is tiny, you can even build the trie by hand. Here is a part of it.

. => E
    . => I
        . => S
        - => U
    - => A
        . => R
        - => W
- => T
    . => N
        . => D
        - => K
    - => M
        . => G
        - => O

To exploit the trie, write a recursive function that takes as input a position in the input stream and a node in the trie. If the node has a terminal character, append the terminal character to the output string and reset the node to the root of the trie. At the same time, continue the exploration of the trie by following the son that matches the next input symbol.

Here are the few first steps (analysis of the first four input symbols) of the recursive execution in your example case:

. => E
    .|. => EE
        .|.|. => EEE
            .|.|.|. => EEEE
            .|.|.. => EEI
        .|.. => EI
            .|..|. => EIE
            .|... => ES
    .. => I
        ..|. => IE
            ..|.|. => IEE
            ..|.. => II
        ... => S
            ...|. => SE
            .... => H

score 0 · Answer 3 · answered Jan 25 '16 at 11:55

You can do it in 2 passes. First will mark the positions where it's possible for a letter to end and the second will extract all possible strings.

The first pass you can implement as a dynamic programming. possible[x] is true if it's possible to parse the first x letters into some characters. You strart with possible[0] = true then compute for all other x the value of possible. To compute it you take the last 1,2,3 and 4 characters and if they match some existing morse code and the value of possible corresponding to the rest of the string is true than mark possible[x] true as well. This is O(N).

Now you have to extract all the possible words. So, start from the end, and use possible vector to eliminate wrong solutions. Here you should again try the last 1-4 characters see if they match and if they do the the corresponding possible position is true then you take it as a possible character and recursively call the function to generate all the solutions for the what remains. This unfortunately is exponential O(4^N) in the worst case (when partition is possible). In practice this will run through the number of possible words that match the input string, so if there are only a few options, this pass will be fast as well.

To note, the longer the string the more likely it is that you have more options and more possible interpretations.

If in addition you restrict the possible words to a predefined set you can modify the first pass to use the words instead of individual letters. Then the number of possible interpretations should decrease a lot and your algorithm will be fast even on longer strings.

Morse without separators - best algorithm

3 Answers3