Ruby Cyphering Leads to non Alphanumeric Characters

Question

I'm trying to make a basic cipher.

def caesar_crypto_encode(text, shift)  
  (text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr| 
  ((cstr.ord)+shift).chr }
end

but when the shift is too high I get these kinds of characters:

  Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")

  Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"

What is this format?

These are often better implemented with a simple mapping table and [`tr`](https://ruby-doc.org/core-2.4.1/String.html#method-i-tr) where there's no chance of introducing busted characters. The `chr` function makes no guarantees about output validity. — tadman, May 15 '17 at 04:28

Casper · Accepted Answer · 2017-05-15T14:06:11.623

The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).

ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.

That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".

If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.

Here's what you get if you do that:

s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
>> s.encoding
=> #<Encoding:UTF-8>

# Re-encode as ISO8859-1.
# Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
>> s.force_encoding('iso8859-1')
=> "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"

# In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to 
# convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
# can display the ISO8859-1 8-bit characters using UTF-8 encoding:
>> s.encode('UTF-8')
=> "Çäëëî öîñëã!"

# Another way is just to repack the bytes into UTF-8:
>> s.bytes.pack('U*')
=> "Çäëëî öîñëã!"

Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.

A better solution

Like @tadman suggested, you could use tr instead:

AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'

"Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
=> "eBIIL tLOIA!

score 0 · Answer 2 · answered May 15 '17 at 02:52

I'm still curious about that format though...

Those characters represent the corresponding ASCII encoding after getting the ordinal (ord) of each letter and adding 127 to it (i.e. (cstr.ord)+shift).chr)

Why? Check Integer#chr, from the docs:

Returns a string containing the character represented by the int's value according to encoding.

So, for example, take your first letter "H":

char_ord = "H".ord
#=> 72

new_char_ord = char_ord + 127
#=> 199

new_char_ord.chr
#=> "\xC7"

So, 199 corresponds to "\xC7". Keep changing all characters in "Hello world" and you will get "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3".

To avoid this you need to loop only with ord values that represent a letter (answer in the Possible duplicate link).

Ruby Cyphering Leads to non Alphanumeric Characters

2 Answers2