0

I'm using pandas, and exporting data as json like this:

import pandas as pd
df = pd.DataFrame({'a': ['Têst']})

df.to_json(orient='records', lines=True)
> u'{"a":"T\\u00east"}'

This makes sense since we have a Unicode character 00ea prefixed with \u and it is escaped with \ when converted to JSON

But then I import the JSON strings into Postgres with COPY

buffer = cStringIO.StringIO()
buffer.write(df.to_json(orient='records', lines=True))
buffer.seek(0)

with connection.cursor() as cursor:
  cursor.copy_expert(sql="""
  COPY tmp (json_data) FROM STDIN WITH NULL AS '' ENCODING 'UTF8';
  """, file=buffer)

The problem is that the result in the database ends up being

{"a": "Tu00east"}

and as you can see the double \\ is gone.

I tried using CSV as the COPY mode, but it messes things up since there are commas in some of the data, and trying to set ESCAPE character and DELIMITER to something else always seem to cause failures.

The table column has a jsonb type. I read in the docs that PG doesn't like non-ASCII Unicode over \x003f unless the DB encoding is UTF8, which it is in my case, so that shouldn't be an issue.

I'm trying to figure out why the escaping characters are removed here, and how to import into Postgres and conserve the encoding.

MrE
  • 19,584
  • 12
  • 87
  • 105

1 Answers1

2

Use the csv option for COPY, with DELIMITER e'\x01' QUOTE e'\x02'. I'm not sure whether this works for all possible valid JSON, but I've never had it fail.

$ psql -X testdb -c 'create table t(d jsonb)'
CREATE TABLE
$ cat foo.json
{"a":"Têst"}
$ cat foo.json | psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'" 
COPY 1
$ psql -X testdb -c 'select * from t';                                                    
       d       
---------------
 {"a": "Têst"}
(1 row)
AdamKG
  • 13,678
  • 3
  • 38
  • 46
  • ok, interesting. Can you explain why \x01 and \x02? I guess just anything that's not going to be found in the data, right? – MrE Nov 02 '18 at 04:55
  • Seems like the trick with CSV is to AVOID any common character. I was trying to set the quote, escape etc... to be what I have, but I see the trick here since instead PG needs to get the strings as is, since they're already escaped and encoded right. – MrE Nov 02 '18 at 05:07
  • what about encoding support at import ? – user1767316 Aug 15 '19 at 17:00