0

I am using Java SAX parser to read data from Excel (using XSSF XLSX2CSV class) and load it in to Greenplum database. I am using the code from the following link:

http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel/XLSX2CSV.java

I capture the PrintStream output from the above code, convert it on ByteInputStream, and then load it in to Postgres(Greenplum) using native bulk load utility - copy - command.

I modified the following in the main method of XLSX2CSV to capture the printstream and convert it in to byte input steam.

ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintStream ps = new PrintStream(baos, true, "UTF-8");

// The package open is instantaneous, as it should be.
try (OPCPackage p = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ)) {
XLSX2CSV xlsx2csv = new XLSX2CSV(p, ps, minColumns);
xlsx2csv.process();
System.out.println(ps);
String data = new String(baos.toByteArray(), StandardCharsets.UTF_8);
System.out.println(data);
byte[] bytes = data.getBytes("UTF8");
ByteArrayInputStream orinput = new ByteArrayInputStream(bytes);
String dbURL1 = "jdbc:postgresql://xxxxx:xxxxx/xxxxx";
String user = "xxxxxx";
String pass = "xxxxxx";
Connection GPConnection = DriverManager.getConnection(dbURL1, user, pass);

 Statement  GPsqlStatement = GPConnection.createStatement();
 String GPStgTableTrunc = "truncate test_table";
 GPsqlStatement.execute(GPStgTableTrunc);
 System.out.print("Load to Greenplum starts "+ 
  Calendar.getInstance().getTime() + "\r\n");

 CopyManager copyManager = new CopyManager((BaseConnection) GPConnection);
copyManager.copyIn("copy test_table from stdin csv",orinput);
System.out.print("Load to Greenplum ends "+ 
Calendar.getInstance().getTime() + "\r\n");

However during the conversion to ByteInputStream, the newline seems to be lost and I get the following error while loading in to Greenplum..

ERROR: COPY metadata not found. This probably means that there is a mixture of newline types in the data. Use the NEWLINE keyword in order to resolve this reliably. (seg40 sdw6.gphd.local:1025 pid=101588)

When I print the string 'data', it does seem to have newline and the values are getting printed correctly.. however it doesn't load while doing bulk load in to DB.

How to preserve the newline in the above scenario so that the load happens correctly? Or if there is a way to convert the printsream in to standard input, that works too. Thanks!

CuriP
  • 83
  • 10

1 Answers1

1

Try : "\r\n" instead of "\n"

ByteArrayOutputStream output = new ByteArrayOutputStream();
output.write("something\r\n"".getBytes());
output.write("something\r\n"".getBytes());

ByteArrayOutputStream input = new ByteArrayInputStream(output.getBytes());
s3.putStream(input);

Looks similar to this:

ByteArrayOutputStream/InputStream losing newline characters on S3 Import

Adding sample code which has been tried with PrintStream below:

static void printStream() throws Exception {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        PrintStream ps = new PrintStream(baos, true, "UTF-8");
        ps.println("test 1");
        ps.println("test 2");
        ps.println("test 3");
        System.out.print(new String(baos.toByteArray()));
    }

it is printing:

test 1
test 2
test 3
greengreyblue
  • 389
  • 3
  • 12
  • Hi, thanks for checking it out. I am loading the printstream output directly in to byte input stream.. so anything that needs to be added should be directly added to print stream. So, I tried changing output.append('\n') to output.append('\r\n');, but append seems to take only one character. So I added output.append('\r'); as next, but that unfortunately introduces a blank line after every data row. – CuriP Nov 06 '18 at 13:48
  • Okay. Added code to the edited answer tried `PrintStream` and its corresponding output. it will help if you share the code which is causing the issue – greengreyblue Nov 06 '18 at 14:24
  • The code is in the link, I am using it as it is. You can provide any excel as input and it should be able to convert and print that that as CSV. The changes I am making further are provided with this question. – CuriP Nov 06 '18 at 18:53