0

Before I save a byte array to the database if I print the output of new String(data) it returns a readable string like "foobar" but after I pull it out of the database, new String(data) will read like a bunch of gibberish like "9238929384739427349327...". There's so many parts here I'll just try to list them all. I'm using eclipselink and my data column is defined:

@Lob
@Column(name = "data")
private byte[] data;

If I run this code:

public static void main(String[] args) {
    System.out.println(Charset.defaultCharset());
}

It outputs windows-1250.

My database is defined as:

CREATE DATABASE project_trunk
  WITH OWNER = project
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'English_United States.1252'
       LC_CTYPE = 'English_United States.1252'
       CONNECTION LIMIT = -1;

I've also tried this on a DB defined like this:

CREATE DATABASE project_trunk
  WITH OWNER = project
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_US.UTF-8'
       LC_CTYPE = 'en_US.UTF-8'
       CONNECTION LIMIT = -1;

And the problem still occurs.

I think what's happening is my database has a different encoding from my appserver. When I put things into the database and pull it out again, it decodes it the wrong way so it looks like gibberish. Am I on to something there?

Now when it comes to a solution to this problem, I'm a little confused. I think what I should be doing is changing my appserver's file encoding to be the same as the database. I'm using Glassfish 2.1.1. When I go to application server -> advanced -> domain attributes and set the Locale to "UTF8" or "UTF-8" it tells me a restart is required. After I restart glassfish, that field is still blank and I still get the error. I think maybe it isn't saving the property. I'd manually put it in the configuration file, but I don't know where or what to put.

Alternatively, I tried creating my database with an ENCODING = 'WIN1250' but when I do that it says my LC_CTYPE needs to be "WIN1252". When I set LC_CTYPE to "WIN1252" it says that encoding doesn't exist.


I'm spending a lot of time on this, I'd like to know if I'm on to something here. Does my theory of "out of sync encodings between appserver and db" sound correct, or am I chasing a red herring? If anyone could help me figure out how to change this setting for glassfish 2.1.1 that'd also be very helpful. Thanks

EDIT: People are asking why I'm storing Strings as raw bytes. That's not exactly what I'm doing, sometimes the raw bytes represent an image or a pdf or a binary, sometimes it's text. My test is inserting a plain text String and pulling it back out to make sure it got saved correctly. This test passes on our CI server which is on linux.

EDIT2: I was asked to show the raw binary input vs raw binary output.

Expected :[116, 104, 105, 115, 32, 105, 115, 32, 109, 121, 32, 97, 116, 116, 97, 99, 104, 109, 101, 110, 116, 32, 97, 115, 32, 97, 32, 83, 116, 114, 105, 110, 103]

Actual :[60, 54, 56, 54, 57, 55, 51, 50, 48, 54, 57, 55, 51, 50, 48, 54, 100, 55, 57, 50, 48, 54, 49, 55, 52, 55, 52, 54, 49, 54, 51, 54, 56, 54, 100, 54, 53, 54, 101, 55, 52, 50, 48, 54, 49, 55, 51, 50, 48, 54, 49, 50, 48, 53, 51, 55, 52, 55, 50, 54, 57, 54, 101, 54, 55]

I gave this same test that checks the bytes to my coworker who's on a mac, and it passes for him.

Community
  • 1
  • 1
Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356

2 Answers2

2

sometimes the raw bytes represent an image or a pdf or a binary, sometimes it's text

Okay, then you shouldn't be storing them as text.

Regardless of what's currently going wrong, even if you can get this to work for data which is actually text, you've got problems coming later on.

If you must store arbitrary binary data as text, you should use base64 to encode it - that way you can get back to the original binary with no problems. (You've only got to be able to transport ASCII strings around, and that's usually reasonably easy.) There are lots of third-party libraries for Base64; I like this self-contained public domain one.

Alternatively, store the data as binary data in the database, e.g. using a field of the bytea data type. That way you shouldn't need to do any conversion work: you should just be able to put it into the database as a byte array, and get it out as a byte array.

EDIT: Okay, it looks like you're getting back the hex representation of the binary data, but in ASCII. That's distinctly odd.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I'm sorry, I'm not communicating very well. I'm not storing the binary as text, I'm storing the binary as binary. But to test that the binary I put in is the same as the binary I pull out, I'm using `new String(...)` on the binary data. When that string happens to be text, it's obvious that it came out differently. In the DB, it is a bytea, and in JPA it's an @Lob – Daniel Kaplan Feb 12 '13 at 19:39
  • 1
    @tieTYT: Don't convert it to a string at all. That's just clouding the issue (very significantly). Why don't you post the *exact bytes* before and after? `Arrays.toString(byte[])` should be fine to get diagnostic data. When you're *only* dealing with binary data, everything you've posted about locales and character sets should be *completely* irrelevant. – Jon Skeet Feb 12 '13 at 19:42
  • OK, I edited my question with that information. They look kind of far off. – Daniel Kaplan Feb 12 '13 at 21:19
  • In addition to that, I just gave this same test to my coworker who's using a Mac and the test passes – Daniel Kaplan Feb 12 '13 at 21:24
  • @tieTYT: It looks like you're getting hex back (each byte of input is returned as two bytes of output, which decoded as ASCII would be the hex representation). Can you reproduce this with a short but complete program? Is your coworker with a Mac using the same code and the same versions of all libraries? – Jon Skeet Feb 12 '13 at 21:26
  • re: my coworker, I think it's pretty safe to say yes. We may be a few days out of sync w/ regards to source code, but I've been having this issue for months and we all use maven so we should be getting the same libs. My short but complete program would need to access the db to work because that's how I reproduce this. Would that be ok? – Daniel Kaplan Feb 12 '13 at 21:29
  • @tieTYT: Yes, that's fine. It would be nice to try to reproduce it, basically. Have you worked out whether the problem is how your code writes, or how your code reads? What happens if your colleague *reads* the data you've written? – Jon Skeet Feb 12 '13 at 21:38
  • I made a test case that passes for me. http://pastebin.com/Ar6MfgUb This is connecting to the same database with a test table. I think this suggests there's some code in the middle of our app that's modifying the input. – Daniel Kaplan Feb 12 '13 at 21:52
  • @tieTYT: Right. That wouldn't surprise me at all. Unfortunately that means it's going to be hard for us to help you further :( – Jon Skeet Feb 12 '13 at 22:01
  • Jon: That's ok, writing that stand alone test case was a worthwhile exercise I should have done earlier. I found a different coworker, also on a mac, where this test fails for them. – Daniel Kaplan Feb 12 '13 at 22:07
  • Check out my answer, this is the real cause of the problem. – Daniel Kaplan Feb 13 '13 at 00:24
0

Turns out this is caused by a way that PostgreSQL works in version 9 vs version 8. Most of my coworkers were using version 8 but I recently got a new computer so I used the latest PostgreSQL.

You need to set the output_bytea to "escape".

Getting nearly double the length when reading byte[] from postgres with jpa

I didn't find the answer there sufficient though, but I found this on a mailing list and it fixed my issue: http://www.postgresql.org/message-id/AANLkTikkE-jQ9srZ9VL1JuJ5h=UCutx8ZLim+OfQ1T4z@mail.gmail.com

Dear List ,

recent change of bytea_output format from escape to hex in 9.0 apparently breaks popular persistent session handling perl modules like Apache::Session::Postgres which stores pickled data structures in bytea columns of db table. It is difficult to guess the root cause from the exception thrown by the said module. The problem is fixed by adding bytea_output='escape' in postgresql.conf and issuing a pg_ctl reload.

Eg in RT application the error is: error: RT couldn't store your session. This may mean that that the directory /blah/blah/foo/bar isn't writable or a database table is missing or corrupt

Regds Rajesh Kumar Mallah.

Community
  • 1
  • 1
Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356