Problems trying to parse CSV file with umlaut's etc in it using Ruby 1.8 / FasterCSV

Question

I have a CSV file with lines like this in it:

...,"Städtische Galerie im Lenbachhaus",...

I am using Ruby 1.8, with the FasterCSV gem, like so:

FasterCSV.foreach(file, :encoding => 'u', :headers => :first_row) do |r|
    as = ImportObject.create!(r.to_hash)
end

For most rows its working fine, but for these rows the field with the special character is getting truncated, so we get "St" saved in the db.

I have put $KCODE="u" and with/without the encoding option, to no avail.

The DB is MySQL.

EDIT:

I tried pushing the code up to Heroku (Postgres) and now getting a new error:

2011-02-19T17:19:01-08:00 app[web.1]: ActiveRecord::StatementInvalid (PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xe46474

2011-02-19T17:19:01-08:00 app[web.1]: HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

2011-02-19T17:19:01-08:00 app[web.1]: : INSERT INTO "import_objects" (... "title", ...) VALUES (..., 'St?dtische Galerie im Lenbachhaus', ...) RETURNING "id"):

:(

What would you get if you just read the file and printed the name straight from Ruby? That is, is it Ruby that's choking on the encoding, or does it only get truncated once it's been through the DB? — Xavier Holt, Feb 19 '11 at 23:39
Hi @Xavier, I tried that but am not sure of the results... It showed more than just St, but did not show the special characters, ie it was "St?dtische Galerie Lenbachhaus" - but then gets truncated when its put in the DB. Not sure if issue is due to misreading the file or not configured correctly to write it. — Chris Kimpton, Feb 20 '11 at 01:24
Since FasterCSV can presumably read UTF-8 just fine, I'm taking the fact that it didn't understand your ä as an indicator that your file isn't actually UTF-8 encoded, but _something_ else... I'm a bit out of my league at this point, but wuputah has a (second) answer that might be helpful. Good luck! — Xavier Holt, Feb 20 '11 at 04:45

wuputah · Accepted Answer · 2011-02-20T01:55:27.320

The problem is likely a file encoding issue, as you have surmised. The most likely scenario is your file is not actually encoded with UTF-8, so the rest of your application cannot recognize the foreign encoding. It's also possible -- but I believe quite unlikely -- that one of the bytes used in the encoding is a quote or comma in ASCII, which will mess up FasterCSV parsing the data.

First, make a test file with just the "problem row" in your CSV file. Next, read the data in the file:

text_in = File.read('data.csv')

Now you have to convert it. The problem is, you don't really know what it is. You'll have to try a few different things. My best guess is the text is Latin-1 encoded.

require 'iconv'
text_out = Iconv.conv("UTF8", "LATIN1", text_in)

Now try to import this data. Alternatively, you can write to disk and open it, and see if it's encoded properly.

But honestly, you can do this outside of Ruby much more easily.

$ iconv -t UTF8 -f LATIN1 < data.csv > data_conv.csv

Further reading:

PERFECT - the iconv commandline did the trick - the file now loads to the DB with umlauts and all. FWIW, on my MBA, its UTF-8 not UTF. — Chris Kimpton, Feb 20 '11 at 09:06

score 1 · Answer 2 · answered Feb 19 '11 at 23:49

The problem is not FasterCSV, as in my testing, FasterCSV does not have a problem reading this data. For instance:

>> FasterCSV.parse("a,Städtische Galerie im Lenbachhaus,b,ä", :headers => [:a,:b,:c,:d]) do |r|
|    r = r.to_hash
|    p r
|    puts r[:d]
|  end  
{:c=>"b", :a=>"a", :d=>"\303\244", :b=>"Städtische Galerie im Lenbachhaus"}
ä

Note that Ruby 1.8 doesn't handle unicode characters properly, but principally this affects things like String#length. For instance, Ruby will return the length of this string as 34 instead of 33. However this doesn't have an affect until you do something with the string, like run a validation on it.

>> "Städtische Galerie im Lenbachhaus".length
=> 34
>> "Stadtische Galerie im Lenbachhaus".length
=> 33

So my guess is it's something about ImportObject or how your database connection is configured.

Ruby version used in these tests:

>> RUBY_DESCRIPTION 
=> "ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02"

Hi wuputah - I am using "ruby 1.8.7 (2010-12-23 patchlevel 330) [i686-darwin10.5.0]". ImportObject is an ActiveRecord::Base object, no validations - but I do use attr_accessible. I have tried amending the AR object through the Rails console and that seemed to work with the special chars - so I thought that side was ok... — Chris Kimpton, Feb 20 '11 at 01:29
Fair enough. Based on your comments, it sounds like you do have a file encoding issue - its not FasterCSV itself per se, but the reading of the file into Ruby... will post a new answer. — wuputah, Feb 20 '11 at 01:35

score 1 · Answer 3 · answered Feb 19 '11 at 23:56

You don't say what database type you're using, but it's very possible the DB is not configured for UTF-8, and instead is expecting ASCII. Throwing non-ASCII characters at it could result in a truncated string, a missing character, or a character replaced with a placeholder, depending on the database and what gem or ORM you're using to talk to it. When I build a database I make sure it's configured for UTF-8, or, I make sure the text I push into it is encoded so it can make a round-trip without corruption or loss. I learned that lesson the same way you are, the hard way.

Check the database's log, and/or, check your code to see whether you can enable logging and error and warning messages for the database inserts.

It's easy to disable warnings and errors with a lot of databases, but during development you don't want to do that. Those messages are important and can signal big problems to come. Ignoring them and pushing code to production can be a real recipe for sleepless nights.

Hi - using MySQL - I can enter these special characters through the front end and its taking them and re-presenting them - so I think thats ok... thanks. — Chris Kimpton, Feb 20 '11 at 01:25

Problems trying to parse CSV file with umlaut's etc in it using Ruby 1.8 / FasterCSV

3 Answers3