Dealing with the null character in a 'text' field with Rails and Postgresql

Question

I recently went through a round of testing with various inputs on my rails application, and I've discovered a problem with how null characters in incoming requests are handled. My application is backed by a Postgresql 8.4 database. It turns out that Postgresql doesn't support storing null characters in the standard 'text' or 'varchar' field. However, when I try to store a string that does have a null character in it through rails, the default behavior appears to be truncating the string after the null character. I view this as a problem because the string is validated before the truncation occurs. Validations for things like length can be bypassed by inserting the null characters since ruby can deal with them perfectly fine and the truncation only happens on insert.

I'm trying to figure out the best way of dealing with these inputs. Ideal behavior for me would be throwing an exception somewhere like when dealing with invalid UTF-8 bytes. Right now, the only option that I can think of is explicitly checking every input string for null characters. I would much rather have a generalized approach, but I'm not sure where to even start looking. Would patching the postgresql adapter to check for this be an option? Or is there some standard approach that I've been missing?

Who is truncating the string? PostgreSQL? Rails? Rack? PostgreSQL, thankfully, tends to complain loudly if you hand it invalid data; I'm guessing that the zero byte never even makes it to your model. — mu is too short, Jul 18 '12 at 22:48
From my testing, it appears that the Rails->PostgreSQL adapter (pg gem) is doing the truncation, since I see the null byte in my Rails model, and I don't get the errors I was expecting from PostgreSQL. — Belly, Jul 30 '12 at 22:36
Have you considered using a binary data type (i.e. `bytea`) for storing binary data? — mu is too short, Jul 30 '12 at 23:03
I considered that, but it makes things much more difficult overall since I have to do all the encoding/decoding by hand. Fundamentally, I -do- want to be storing text. I'm just having trouble reconciling rails and postgresql definitions of UTF-8. — Belly, Aug 03 '12 at 18:38
No one's definition of UTF-8 allows zero bytes. There are no zero bytes in UTF-8. Ever. If you have zero bytes then you have binary data that contains embedded UTF-8 text. — mu is too short, Aug 03 '12 at 18:45
Are you sure about that? Looking through the [wikipedia article](http://en.wikipedia.org/wiki/UTF-8), it appears that the UTF-8 encoding for the NUL codepoint U+0000 would be the null byte. Unless it was encoded as modified UTF-8. Ruby's UTF-8 encoding does not appear to be the modified version: "\u0000".bytes.to_a => [0] — Belly, Aug 03 '12 at 22:42
Yes, that's my problem. 'text' cannot store \x00, but Rails encodes U+0000 as \x00. I am looking for a general approach for Rails to either mark U+0000 as invalid for all input, escape/unescape automatically, or throw an exception when attempting to insert it into postgresql. — Belly, Aug 06 '12 at 18:09
No, Rails doesn't encode it as `\x00`, that just how Ruby shows you the unprintable zero byte. Switch to a binary column type or add validations that look for zero bytes. — mu is too short, Aug 06 '12 at 18:21
Sorry, yes, I was using \x00 to represent the 0 byte for our purposes. I was looking for a way to avoid having to do the encoding by hand and losing the postgresql string manipulation functions or adding validations to every string input by hand. But it appears that there's no way around that, which is unfortunate. Thanks for the input. Could you post this as an answer so that I can accept it? — Belly, Aug 06 '12 at 21:06

Dealing with the null character in a 'text' field with Rails and Postgresql

0 Answers0