0

In Ruby 1.9.3-429, I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.

Simplified example:

File.open(file) do |io|
  io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
  line, char = "", nil

  until io.eof? || char == ?\n || char == ?\r
    char = io.readchar
    puts "Character #{char} has #{char.each_codepoint.count} codepoints"
    puts "SLICE FAIL" unless char == char.slice(0,1)

    line << char
  end
  line
end

Both files are just a single string áÁð encoded appropriately. I have checked that the files have been encoded correctly via $ file -i <file_name>

With a UTF-8 file, I get back:

Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints

With an ISO-8859-1 file:

Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL

The way I am interpreting this is readchar is returning an incorrectly converted encoding which is causing slice to return incorrectly.

Is this behavior correct? Or am I specifying the file external encoding incorrectly? I would rather not rewrite this process so I am hoping I am making a mistake somewhere. There are reasons why I am parsing files this way, but I don't think those are relevant to my question. Specifying the internal and external encoding as an option in File.open yielded the same results.

bbxiao1
  • 13
  • 3

1 Answers1

0

This behavior is a bug. See http://bugs.ruby-lang.org/issues/8516 for details.

bbxiao1
  • 13
  • 3