How can I analyze or improve my niece's simple compression algorithm that is based on Morse code?

Question

My 8 year old niece had a lesson on Morse code in school yesterday, and her assignment was to convert various phrases to Morse code. One of the phrases included her age, and instead of writing ---.., she wrote 3-2. because (in her words), "it's less writing that way." This rudimentary "compression algorithm" sparked my curiosity, so I wrote a bit of code to implement it.

However, we made a few changes along the way. I pointed out to her that if you wrote just .....-----, there isn't any way to tell if the writer meant 50 or eeeeettttt. In reality, there is a pause in between each letter of each word and every word so this isn't a problem, but our scheme didn't have that. I pulled out some graph paper and suggested padding the Morse code for each symbol with another symbol to facilitate coding and remove ambiguity from the scheme. My nice suggested using + because "no one ever writes those in sentences." (Ouch, I recently graduated with a math degree, but fair enough.)

Since some of us do write with +, and we all use hyphens and periods/dots, which would conflict with our standard definition of Morse code, these symbols are replaced with p, h, and d, respectively. Of course, this brings us to the problem of what to do with symbols that aren't defined in our extended Morse code. My niece wanted to simply ignore them, so that's what we did. For the sake of case sensitive preservation of the textual message, upper case letters aren't lower cased in the code; they're just carried through as is and padded with +.

Summary of the algorithm:

Morse codes are right-padded to 5 characters with +
We extended Morse code to substitute p for +, d for ., and h for -.
Symbols that aren't defined in our "extended" Morse code are passed through intact.
Runs of symbols are replaced with unless only one consecutive character occurs, in which case the number is omitted.

Potential pitfalls:

My padding scheme probably reduces the effectiveness of the compression.
Compression might be improved by using blocks larger than 5 characters
If either my niece or I knew anything about compression algorithms, we could probably use that makes them successful.
This obviously isn't suitable for production, but since there are numerous efficient compression algorithms for such purposes, I'm ignoring that problem for the time being.
???

Example:

In our algorithm, "Hello, World" translates to

H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++

and compresses to

H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+

Here is the Python code I threw together:

#!/usr/bin/python3

import itertools
import string

class MorseString(str):
    def __init__(self, string):
        # or, pad values during iteration but this seems neater
        self._char_morse_map = {"a":".-+++", "b":"-...+", "c":"-.-.+", "d":"-..++", \
                                "e":".++++", "f":"..-.+", "g":"--.++", "h":"....+", \
                                "i":"..+++", "j":".---+", "k":"-.-++", "l":".-..+", \
                                "m":"--+++", "n":"-.+++", "o":"---++", "p":".--.+", \
                                "q":"--.-+", "r":".-.++", "s":"...++", "t":"-++++", \
                                "u":"..-++", "v":"...-+", "w":".--++", "x":"-..-+", \
                                "y":"-.--+", "z":"--..+", "1":".----", "2":"..---", \
                                "3":"...--", "4":"....-", "5":".....", "6":"-....", \
                                "7":"--...", "8":"---..", "9":"----.", "0":"-----",
                                " ":"+++++", ".":"d++++", "+":"p++++", "-":"h++++"}
        
        self._morse_char_map = dict()
        for letter, code in self._char_morse_map.items():
            self._morse_char_map[code] = letter

        self._string = string

        # convert the string to "Morse code". Could also use c.lower()
        self._morse_string = "".join([self._char_morse_map.get(c, c.ljust(5, "+")) for c in self._string])

    def compress(self):
        def grouper(n, k):
            return str(n) + k if n > 1 else k

        # could just use lambda
        return "".join([grouper(len(list(g)), k) for k, g in itertools.groupby(self._morse_string)])

    def decompress(self):
        i = 0
        start = 0
        chars = list()
        sc = self.compress()
        while i < len(sc):
            curr = sc[i]
            i += 1
            if not(curr in string.digits):
                num = 1 if start + 1 == i else int(sc[start:i-1])
                chars.append("".join(curr * num))
                start = i

        code = "".join(chars)
        chars = list()

        for i in range(0, len(code), 5):
            piece = "".join(code[i:i+5])
            chars.append(self._morse_char_map.get(piece, piece[0]))
            
        return "".join(chars)


def main():
    s0 = "Hello, World"
    ms0 = MorseString(s0)
    print(ms0._morse_string)
    print(ms0.compress())
    assert(s0 == ms0.decompress())
    
    s1 = "Hello  2 world."
    ms1 = MorseString(s1)
    assert(s1 == ms1.decompress())

    s2 = "The quick brown fox jumped over the lazy dog."
    ms2 = MorseString(s2)
    assert(s2 == ms2.decompress())

    s3 = "abc  -.+"
    ms3 = MorseString(s3)
    ms3
    assert(s3 == ms3.decompress())

if __name__ == "__main__":
    main()

What are some simple methods that would a) improve our algorithm, and b) be relatively easy to explain to my 8-year old niece? While the last point is clearly subjective, I'm nevertheless trying to indulge her curiosity as much as possible.

I welcome any improvements to the code as well since it's not structured terribly well (I'm fairly certain it's structured quite poorly, actually, but it's quick and dirty), although that's strictly for my benefit since I haven't gotten my niece to use Python (YET).

Update

Here is an updated version of the code, which attempts to incorporate both user1884905's modifications to the algorithm and Karl's improvements to the code itself.

import itertools
import string

_char_morse_map = {"a":".-", "b":"-...", "c":"-.-.", "d":"-..", \
                   "e":".", "f":"..-.", "g":"--.", "h":"....", \
                   "i":"..", "j":".---", "k":"-.-", "l":".-..", \
                   "m":"--", "n":"-.", "o":"---", "p":".--.", \
                   "q":"--.-", "r":".-.", "s":"...", "t":"-", \
                   "u":"..-", "v":"...-", "w":".--", "x":"-..-", \
                   "y":"-.--", "z":"--..", "1":".----", "2":"..---", \
                   "3":"...--", "4":"....-", "5":".....", "6":"-....", \
                   "7":"--...", "8":"---..", "9":"----.", "0":"-----",
                   " ":"", ".":"d", "+":"p", "-":"h"}

_morse_char_map = {
    code: letter
    for letter, code in _char_morse_map.items()
}


def encode(s):
    return "".join(_char_morse_map.get(c, c) + "+" for c in s)

def decode(encoded):
    return "".join(decode_gen(encoded))

def decode_gen(encoded):
    word = ""
    for c in encoded:
        if c != "+":
            word += c
        else:
            yield _morse_char_map.get(word, word) if word != "" else " "
            word = ""

def compress(s):
    def grouper(n, k):
        return str(n) + k if n > 1 else k

    return "".join(grouper(len(list(g)), k) for k, g in itertools.groupby(s))

def decompress(compressed):
    return "".join(decompress_gen(compressed))

def decompress_gen(compressed):
    digits = ""
    for c in compressed:
        if c in string.digits:
            digits += c
        else:
            number = int(digits) if digits else 1
            yield "".join(c * number)
            digits = ""

Excellent, teach her Huffman encoding next, she'll gobble it up. — temporary_user_name, Mar 28 '13 at 03:06
you might want to post this on [CodeReview.SE](http://codereview.stackexchange.com/) instead... — MattDMo, Mar 28 '13 at 03:06
By the way, what your niece came up with is essentially a running-length encoding (RLE). — , Mar 28 '13 at 03:18
If you already taught this much to an 8-year-old you're doing an amazing job!!! — Karl Knechtel, Mar 28 '13 at 04:39
Good job teaching your niece this stuff. When my cousin gets a kid I'm planning of teaching him/her this kind of stuff as well. The hardest part is making it fun. Could you maybe share with us how you get her to find it fun? — WereWolfBoy, Mar 29 '13 at 14:05
@WereWolfBoy Right now, I think my niece is curious enough that she finds just going back and forth talking about it fun, but I haven't really showed her much code at the moment, so I'm not sure how I'll keep up the fun. She likes writing "secret messages" (what little kid doesn't?), so I might try that. I don't know much about crypto but maybe I'll work that in a bit too. — John Bensin, Mar 29 '13 at 19:16
@WereWolfBoy Any ideas, though? I'm a 22-year-old fresh graduate who's childless, so my knowledge of parenting and making learning fun for children is pretty close to zero. — John Bensin, Mar 29 '13 at 19:17
@johnBensin I am a 20 year old student who's childless. So my knowledge of parenting is also close to zero. However I would think that making it a game would be the best way to do it. I have the tendency to get mad at people when they aren't smart or fast enough with the computer (like I say press that button and the other person is like, nope I'm gonna click the other one) that is something you should absolutely avoid or else she will probably never want to do something like that ever again. :P — WereWolfBoy, Apr 02 '13 at 09:56

user1884905 · Answer 1 · 2013-03-29T13:51:26.930

4

A simple change that can be made to compress your output (at least in most cases) is to keep the idea of a pause between two letters and to let your +-signs denote this pause (i.e. the pluses are used as a new-character-character instead of using them as padding).

This will make all numbers 0-9 one character longer but make quite a few commonly occurring letters shorter.

For example a, e, i and t will become .-+, .+, ..+ and -+ instead of .-+++, .++++, ..+++ and -++++. (And space can be denoted + instead of +++++)

So your Hello, World example would become:

H+.+.-..+.-..+---+,++W+---+.-.+.-..+-..+

instead of

H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++

and compress to

H+.+.-2.+.-2.+3-+,2+W+3-+.-.+.-2.+-2.+

instead of

H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+

Understanding this reasoning, Huffman encoding will only seem like the natural next step, where the basic idea is to make the most common characters take up as little amount of space as possible.

Edit:

See also this wikipedia-image of a dichotomatic search table showing the most commonly occurring characters close to the top of the tree.

Morse code tree

edited Mar 29 '13 at 13:51

answered Mar 28 '13 at 08:58

user1884905

3,137
1
22
22

Good points all around, but in your example, you compress `+---+,++W+` to `+3-2+,2+W+`, but isn't the first 2 incorrect? Also, the original message, and the one that you use for your example, is `Hello, World`; the comma is missing from the actual message (and I figure that's important just so future visitors don't get mixed up). – John Bensin Mar 28 '13 at 20:22
`+` denotes the end of letter and the space too. Won't it be better to use say `++`to denote spaces. I can't figure a problem with using a `+` but it doesn't _seem_ right. Also maybe capitalization can be handled by using another spcial character (something like `^`), leading or trailing, to denote this letter is capitalized – RedBaron Mar 29 '13 at 14:16
2

@RedBaron Since all letters end with a `+`, the space character could be viewed as "`nothing followed by a +`". So typically in a text, a space would look like two pluses in a row, just as you say. As far as I can see, this will not lead to any ambiguity. I like your capitalization-idea. – user1884905 Mar 30 '13 at 09:56
@user1884905 I updated my algorithm to include your suggestion to use a single `+` as the delimiter, and that does provide a nice little discount to the compressed size. – John Bensin Apr 04 '13 at 19:06

score 2 · Accepted Answer · answered Mar 28 '13 at 05:09

I would not make a class for this; as is, you re-create the mappings for each string. That could still be avoided with a class, but really it's simpler to just set up those mappings and then write a plain function to encode the string. A "morse code string" is not really fundamentally different from a plain string, and attaching the compression and decompression functions to it doesn't really make any sense. Just write a bunch of functions; worry about OOP when you have a really meaningful abstraction.

As written, your decompression function makes no sense; compressing is not a part of decompressing, and you've combined decompression of the morse code with decoding back to a plain string in the same function. This is a mess.

self._morse_char_map = dict()
for letter, code in self._char_morse_map.items():
    self._morse_char_map[code] = letter

This is more neatly written with a dict comprehension:

self._morse_char_map = {
    code: letter
    for letter, code in self._char_morse_map.items()
}

"".join([...])

The square brackets are unnecessary here; just feed a generator expression to join instead, and take advantage of the special syntax rule.

chars = list()

This is more neatly written chars = [], but let's try to improve this at a higher level...

while i < len(sc):
    curr = sc[i]
    i += 1
    if not(curr in string.digits):
        num = 1 if start + 1 == i else int(sc[start:i-1])
        chars.append("".join(curr * num))
        start = i

A technique: instead of setting an empty list and repeatedly accumulating things to ''.join together, write a generator and let the result be passed instead. This becomes easier when you have the logic properly separated into its own function:

def decompress(compressed):
    return ''.join(decompress_gen(compressed))


def decompress_gen(compressed):
    start = 0
    i = 0
    while i < len(compressed):
        curr = compressed[i]
        i += 1
        if not(curr in string.digits):
            num = 1 if start + 1 == i else int(compressed[start:i-1])
            yield "".join(curr * num)
            start = i

Now, clearly we really want to just iterate over the compressed symbols with a for loop; manually incrementing an index looks really awful. To make this work, we need to look at the data a character at a time, and remember any parts of a number that we've already seen. We could do the arithmetic as we go, but let's preserve the use of int by instead building up a buffer of characters that are part of the count:

def decompress_gen(compressed):
    number_digits = ''
    for char in compressed:
        if char in string.digits:
            number_digits += char
        else:
            number = int(number_digits) if number_digits else 1
            yield "".join(char * number)
            number_digits = ''

chars = list()

for i in range(0, len(code), 5):
    piece = "".join(code[i:i+5])
    chars.append(self._morse_char_map.get(piece, piece[0]))

return "".join(chars)

At this point, code is a string, so the ''.join is not needed in creating piece. But again, we could just be using generators (well, generator expressions) here:

return ''.join(
    self._morse_char_map.get(piece, piece[0])
    for piece in (
        code[i: i + 5]
        for i in range(0, len(code), 5)
    )        
)

+1 for great points. I attempted to incorporate as much as possible into an updated version of the algorithm. — John Bensin, Apr 01 '13 at 18:59
@Karl, To make this a bit more complex, I am curious... (as you have provided such an insightful answer), would it be possible to calculate all of the deletion paths of removing a word (say 'hello') from the original word 'hello world' and return the total amount of paths found based on that removed word? Would you have the same approach? Use a function and store in a dictionary? — thesayhey, Feb 10 '16 at 04:23

How can I analyze or improve my niece's simple compression algorithm that is based on Morse code?

Summary of the algorithm:

Potential pitfalls:

Example:

Update

2 Answers2