Index out of bounds error when doing dataframe union on bzip2 csv data

Question

Problem is pretty weird. If I work with the uncompressed file, there is no issue. But, if I work with the compressed bz2 file, I get an index out of bounds error.

From what I've read, it apparently is spark-csv parser that doesn't detect the end of line character and reads the whole thing as a huge line. The fact that it works on the uncompressed csv but not the .csv.bz2 file is pretty weird to me.

Also, like I said, it only happens when doing a dataframe union. I tried to do rdd union with the spark context, same error.

You might be hitting this bug: https://issues.apache.org/jira/browse/HADOOP-10614 - what Hadoop version are you using? — Tzach Zohar, Oct 16 '16 at 20:43
I'm on spark 2.0.0. It is this error, I'm getting the same stack trace. It says it's fixed but spark-csv is either using an old version of that library or it's not actually fixed. Any idea how I could be fixing that manually? — flipper2gv, Oct 16 '16 at 20:46
Spark can run with various Hadoop versions - which one are you using? This bug seems to have been fixed in 2.5.0, if you're using an earlier version it's probably it. — Tzach Zohar, Oct 16 '16 at 20:49
That's OK :) Umm... no, sorry, if it's 2.7 but looks like the same bug then I'm out of ideas... — Tzach Zohar, Oct 16 '16 at 20:55
I tried again with the built-in csv parser instead of calling the format(databricks) way, same error. Not surprising since it's technically the same code. Thanks anyway. — flipper2gv, Oct 16 '16 at 20:58
@TzachZohar, I'm using ScalaIDE and I didn't run mvn eclipse:eclipse after I specified hadoop 2.7.3. I feel dumb. — flipper2gv, Oct 16 '16 at 23:08

score 2 · Answer 1 · answered Oct 16 '16 at 23:07

My whole problem was that I was using Scala-IDE. I thought I was using hadoop 2.7 but I didn't run mvn eclipse:eclipse to update my m2_repo, so I was still using the hadoop 2.2 (in the referenced libraries, since spark core latest version references hadoop 2.2 by default, I don't know why).

All in all, for future reference, if you plan on using spark-csv, don't forget to specify the hadoop version in your pom.xml even though spark-core references a version of hadoop by itself.

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.7.3</version>
    </dependency>


    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.0.1</version>
    </dependency>

You don't need spark-csv in Spark 2.0. csv source is already built-in. — , Oct 17 '16 at 01:59

Index out of bounds error when doing dataframe union on bzip2 csv data

1 Answers1