2

I need to parse a csv file to xml and write it to hdfs. I managed to do the first part successfully, but get errors when writing. Here's the code.

    private static void writeToXml(String inputPath, String outputPath) throws IOException, JSchException {
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS", "hdfs://nn");
        FileSystem fileSystem = FileSystem.get(configuration);
        Path iPath = new Path(inputPath);
        Path oPath = new Path(outputPath);
        FSDataInputStream inputStream = fileSystem.open(iPath);

        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));

        HSSFWorkbook workbook = new HSSFWorkbook();
        HSSFSheet sheet = workbook.createSheet("Sheet");
        AtomicReference<Integer> row = new AtomicReference<>(0);

        try (Stream<String> stream = bufferedReader.lines()) {
            stream.forEach(line -> {
                Row currentRow = sheet.createRow(row.getAndSet(row.get() + 1));
                String[] nextLine = line.split(";");
                Stream.iterate(0, i -> i + 1).limit(nextLine.length).forEach(i -> {
                    currentRow.createCell(i).setCellValue(nextLine[i]);
                });
            });
            FSDataOutputStream outputStream = fileSystem.create(oPath);
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            workbook.write(out);
            outputStream.write(out.toByteArray());
            outputStream.flush();
        }
    }

It fails with this error.

org.apache.oozie.action.hadoop.JavaMainException: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
Caused by: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
    at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
    at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.syncWithDataSource(POIFSFileSystem.java:779)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.writeFilesystem(POIFSFileSystem.java:756)
    at org.apache.poi.hssf.usermodel.HSSFWorkbook.write(HSSFWorkbook.java:1387)
    at path.to.package.Main.writeToXml(Main.java:81)
    at path.to.package.Main.main(Main.java:24)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:57)
    ... 15 more

At this line.

workbook.write(out);

Edit another snip where I try to write. Fails with the same error.

    FSDataOutputStream outputStream = fileSystem.create(oPath);
    workbook.write(outputStream);
    outputStream.flush();

What am I doing wrong?

gjin
  • 860
  • 1
  • 14
  • 28

1 Answers1

0

In the end I couldn't find what's wrong with my dependencies. I rewrote this whole thing in spark using the following dependecies.

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.11</artifactId>
        <version>2.4.0</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>com.crealytics</groupId>
        <artifactId>spark-excel_2.11</artifactId>
        <version>0.13.7</version>
    </dependency>
    <dependency>
        <groupId>org.apache.xmlbeans</groupId>
        <artifactId>xmlbeans</artifactId>
        <version>3.1.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml-schemas</artifactId>
        <version>4.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-api</artifactId>
        <version>2.14.1</version>
    </dependency>
</dependencies>

Note that xmlbeans is version 3.1.0. Later versions did not work for me for some reason.

Also note, I only needed spark-excel_2.11 to test it localy. Others were added because it kept falling with noClassDefFound when running on cluster.

The code also ended up being a lot simplier.

spark
  .read
  .option("delimiter", ";")
  .csv("/hdfs/path/to/file.csv")
  .repartition(1)
  .write
  .format("com.crealytics.spark.excel")
  .option("header", "false")
  .mode("overwrite")
  .save("/hdfs/path/to/file.xml")
gjin
  • 860
  • 1
  • 14
  • 28