1

I have been breaking my head over the past few hours at trying to figure out what's wrong with my code. This piece of code was all working fine until i received a file which had japanese characters in it. Notepad++ and even some online utility tools say that the encoding of the file is UTF-8. Notepad says its UTF-8-BOM. I have read my data from the file and I have processed it and finally want to write it out to the database.

I get the error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xee My database encoding is UTF8 only..

package citynet.dta.pump;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

import org.postgresql.copy.CopyManager;
import org.postgresql.core.BaseConnection;

import citynet.common.ServerException;

public class TestEncoding {

public static void main(String[] args) {
    byte[] bytes = null;
    try {
        //use the below sql to create table 'testtable'
        // create table testtable (text1 character varying, text2 character varying,text3 character varying)
        try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
            DataOutputStream out = new DataOutputStream(baos);

            out.writeBytes("INR,字仮名交じり文,3255104BTK1");

            bytes = baos.toByteArray();
        }
        Class.forName("org.postgresql.Driver");
        Connection c = DriverManager.getConnection("jdbc:postgresql://server:5432/dbname", "username", "password");
        if (bytes != null) {
            try (ByteArrayInputStream input = new ByteArrayInputStream(bytes)) {
                String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
                BaseConnection pgcon = (BaseConnection) c;
                CopyManager mgr = new CopyManager(pgcon);
                try {
                    mgr.copyIn(sql, input);

                } catch (SQLException ex) {
                    throw new ServerException("Error while copying data in Postgres DB:" + ex);

                }
            }
        }
    } catch (Exception e) {
        System.out.println("Error:" + e);
    }
  }
}

1 Answers1

2

The issue is DataOutputStream#writeBytes("INR,字仮名交じり文,3255104BTK1") is not doing what you expect.

  1. You should refrain from using BaseConnection as it is an internal class. Application code should use PGConnection

  2. Here's how you get CopyManager:

    Connection con = ...;
    PGConnection pgcon = con.unwrap(org.postgresql.PGConnection.class);
    CopyManager mgr = pgcon.getCopyAPI();
    
  3. The source of your data might be different so there are multiple ways to execute copyAPI.

    If you want to convert String to UTF-8 bytes via your own code, then you need getBytes.

    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
    byte[] bytes = "INR,字仮名交じり文,3255104BTK1".getBytes(StandardCharsets.UTF_8);
    mgr.copyIn(sql, new ByteArrayInputStream(bytes));
    

    Note: there's no need to close ByteArrayInputStream (see its Javadoc).

  4. If you need to stream a CSV file to the database, you might use FileInputStream:

    try (InputStream fis = new FileInputStream("file.csv")) {
      mgr.copyIn(sql, fis);
    }
    
  5. If you want to build the contents incrementally, then you might use ByteArrayOutputStream + OutputStreamWriter

    Note: all the rows would need to fit in the memory otherwise you get OutOfMemoryError.

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    try (OutputStreamWriter wr = new OutputStreamWriter(baos, StandardCharsets.UTF_8)) {
      // Write 10 rows
      for (int i = 0; i < 10; i++) {
        wr.write("INR,字仮名交じり文,3255104BTK1\n");
      }
    }
    
    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8'";
    mgr.copyIn(sql, new ByteArrayInputStream(baos.toByteArray()));
    
  6. An alternative option is to use Reader

    Note: encoding is not specified, and it is using connection-default encoding (which is utf-8 in 99.42% of the cases since the driver defaults to utf-8 connection encoding).

    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null'";
    mgr.copyIn(sql, new StringReader("INR,字仮名交じり文,3255104BTK1"));
    
  7. Yet another alternative is to use copyIn(String sql, ByteStreamWriter from) API which might be more efficient for certain use-cases (e.g. all the data is in-memory, and you know the number of bytes you are going to write)

Vladimir Sitnikov
  • 1,467
  • 12
  • 31
  • Thanks a lot Valdimir. I see you have spent a lot of time writing all that up..Appreciate the help :) On few points that you have mentioned, yes we shouldn't be using BaseConnection and as a matter of fact it was only this piece of code where it was using this. All the rest of the application uses PGConnection..So not much code changes thankfully :) and regarding the point about writeBytes not doing what we expect, yes unfortunately I was just focusing on any overloaded method for writeBytes itself which might take encoding and i knew String object had getBytes method..Damn.. – skyrocker121 Apr 21 '20 at 10:26
  • Just a followup question. Do we still have to mention ENCODING UTF8 in the copy command as you mentioned that 99.42% of the cases the driver defaults to utf-8 connection encoding – skyrocker121 Apr 21 '20 at 11:59
  • Well, when you use *byte* API, and you should specify the encoding to make it explicit and avoid unexpected data corruption – Vladimir Sitnikov Apr 21 '20 at 12:40
  • To just be clear. on your point 5, you are saying we should still specify the encoding in String sql = "COPY testtable FROM stdin delimiter ',' NULL AS 'null' ENCODING 'UTF8'"; – skyrocker121 Apr 21 '20 at 13:00
  • Yes, item5 should have explicit enoding in sql otherwise the driver (and the database) does not have enough information. Item6 is different because it uses `String`, so the driver knows the way to convert Java String to a sequence of bytes the database would understand. However, item5 is sending bytes, and the database should know the way to interpret that. – Vladimir Sitnikov Apr 21 '20 at 13:33