4

Parsing some webpages with nokogiri, i've got some issues while cleaning some Strings and saving them with YAML. To reproduce the problem look at this IRB session that reproduces the same problem:

irb(main):001:0> require 'yaml'
=> true
irb(main):002:0> "1,000 €".to_yaml
=> "--- !binary |\nMSwwMDAg4oKs\n\n"
irb(main):003:0> "1,0000 €".to_yaml
=> "--- \"1,0000 \\xE2\\x82\\xAC\"\n"
irb(main):004:0> "1,00 €".to_yaml
=> "--- !binary |\nMSwwMCDigqw=\n\n"
irb(main):005:0> "1 €".to_yaml
=> "--- !binary |\nMSDigqw=\n\n"
irb(main):006:0> "23 €".to_yaml
=> "--- !binary |\nMjMg4oKs\n\n"
irb(main):007:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
irb(main):008:0> "1200000 €".to_yaml
=> "--- \"1200000 \\xE2\\x82\\xAC\"\n"
irb(main):009:0> "120000 €".to_yaml
=> "--- \"120000 \\xE2\\x82\\xAC\"\n"
irb(main):010:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"

To sum up, sometimes .to_yaml outputs are readable while other times the output is unreadable. The most intriguing aspect is that the strings are very similar.

How can I avoid those !binary ... outputs?

marcel massana
  • 367
  • 3
  • 11

1 Answers1

1

Whether YAML prefers to dump a string as text or binary is a matter of ratio between ASCII and non ASCII characters.

If you want to avoid !binary as much as possible, you should use the ya2yaml gem. It tries hard to dump strings as ASCII + escaped UTF-8.

gioele
  • 9,748
  • 5
  • 55
  • 80