This might help you to understand what's happening:
# encoding: UTF-8
RUBY_VERSION # => "1.9.3"
magic_string = "Time Period"
magic_string[0].chr # => "\uFEFF"
The same output is true with Ruby v2.2.2.
Older versions of Ruby didn't default to UTF-8 and treated strings as an array of bytes. The encoding
line is important to tell it what the script's strings' encoding is.
Ruby now correctly treats Strings as arrays of characters not bytes, which is why it reports the first character as "\uFEFF"
, a two-byte character.
"\uFEFF"
and "\uFFFE"
are BOM markers showing which "endian" the characters are. Endianness is tied to the CPU's idea of what a most significant and least significant byte is in a word (two bytes typically). This is also tied to Unicode, both of which are something you need to understand, at least in a rudimentary way as we don't deal with only ASCII any more, and languages don't consist of only the Latin character set.
UTF-8 is an multibyte character set that incorporates a huge number of characters from multiple languages. You can also run into UTF-16LE, UTF-16BE or longer; HTML and documents on the internet can be encoded in varying lengths of characters depending on the originating hardware and not being aware of those can drive you nuts and you'll go down the wrong paths trying to read their content. It's important to read the IO class documentation for "IO Encoding" to know the right way to deal with these types of files.