I'm trying to join a tsv dataset which has a lot of new lines in the data to another dataframe and keep getting
com.univocity.parsers.common.TextParsingException
I've already cleaned my data to replace \N with NAs as I thought that could be the reason but to no success.
The error points me to the following record in the faulty data
tt0100054 2 Повелитель мух SUHH ru NA NA 0
The stacktrace is as follows
19/03/02 17:45:42 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000).
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:
Sesso e il poliziotto sposato IT NA NA NA 0[\n]
tt0097089 4 Sex and the Married Detective US NA NA NA 0[\n]`tt0100054 1 Fluenes herre NO NA imdbDisplay NA 0
tt0100054 20 Kärpästen herra FI NA NA NA 0
tt0100054 2
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1000000
at com.univocity.parsers.common.input.AbstractCharInputReader.appendUtilAnyEscape(AbstractCharInputReader.java:331)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:246)
at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:119)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:400)
... 22 more
I've already tried setting the following in the csv option("maxCharsPerCol","110000000") .option("multiLine","true"), it doesn't help. I'd appreciate any help fixing this.
I'm using spark 2.0.2 & scala 2.11.8.