3

Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)

It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?

I started to look at UniversalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Joelio
  • 4,621
  • 6
  • 44
  • 80

1 Answers1

0

According to FasterCSV's docs, the initialize method takes an :encoding option:

The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values: n??? orN??? for none, e??? orE??? for EUC, s??? orS??? for SJIS, and u??? orU??? for UTF-8 (see Regexp.new()).

Because its list is limited, you might want to look into using iconv to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.

Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.

Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Hi, thanks, but I think you are restating my question for the most part here with more details, my question is there any framework, plugin, gem, wrapper etc that deals with this, it seems like its pretty common to want to build an application that allows users to attach csvs of various formats... – Joelio Mar 02 '11 at 20:53
  • I am not aware of any that can do it, at least there weren't a couple years ago when I was looking at the same sort of problem. iconv has the ability to tell you what it thinks the encoding is. You can tell it to check the most likely suspects and go with whichever scores the highest percentage, but it's still error prone and easily fooled if you don't already know. CSV is a primitive format and not well conforming to a spec, so usually the app owner specifies what encodings are supported, rather than acting like a catch all for whatever someone throws at it based on their whim of the day. – the Tin Man Mar 03 '11 at 00:21