2

Problem is pretty weird. If I work with the uncompressed file, there is no issue. But, if I work with the compressed bz2 file, I get an index out of bounds error.

From what I've read, it apparently is spark-csv parser that doesn't detect the end of line character and reads the whole thing as a huge line. The fact that it works on the uncompressed csv but not the .csv.bz2 file is pretty weird to me.

Also, like I said, it only happens when doing a dataframe union. I tried to do rdd union with the spark context, same error.

flipper2gv
  • 41
  • 3
  • 1
    You might be hitting this bug: https://issues.apache.org/jira/browse/HADOOP-10614 - what Hadoop version are you using? – Tzach Zohar Oct 16 '16 at 20:43
  • I'm on spark 2.0.0. It is this error, I'm getting the same stack trace. It says it's fixed but spark-csv is either using an old version of that library or it's not actually fixed. Any idea how I could be fixing that manually? – flipper2gv Oct 16 '16 at 20:46
  • Spark can run with various Hadoop versions - which one are you using? This bug seems to have been fixed in 2.5.0, if you're using an earlier version it's probably it. – Tzach Zohar Oct 16 '16 at 20:49
  • @TzachZohar , Sorry, brain-farted. I run on hadoop 2.7 – flipper2gv Oct 16 '16 at 20:50
  • That's OK :) Umm... no, sorry, if it's 2.7 but looks like the same bug then I'm out of ideas... – Tzach Zohar Oct 16 '16 at 20:55
  • I tried again with the built-in csv parser instead of calling the format(databricks) way, same error. Not surprising since it's technically the same code. Thanks anyway. – flipper2gv Oct 16 '16 at 20:58
  • @TzachZohar, I'm using ScalaIDE and I didn't run mvn eclipse:eclipse after I specified hadoop 2.7.3. I feel dumb. – flipper2gv Oct 16 '16 at 23:08

1 Answers1

2

My whole problem was that I was using Scala-IDE. I thought I was using hadoop 2.7 but I didn't run mvn eclipse:eclipse to update my m2_repo, so I was still using the hadoop 2.2 (in the referenced libraries, since spark core latest version references hadoop 2.2 by default, I don't know why).

All in all, for future reference, if you plan on using spark-csv, don't forget to specify the hadoop version in your pom.xml even though spark-core references a version of hadoop by itself.

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.7.3</version>
    </dependency>


    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.0.1</version>
    </dependency>
flipper2gv
  • 41
  • 3
  • You don't need spark-csv in Spark 2.0. csv source is already built-in. –  Oct 17 '16 at 01:59