0

I'm trying to join a tsv dataset which has a lot of new lines in the data to another dataframe and keep getting

com.univocity.parsers.common.TextParsingException

I've already cleaned my data to replace \N with NAs as I thought that could be the reason but to no success.

The error points me to the following record in the faulty data

tt0100054 2 Повелитель мух SUHH ru NA NA 0

The stacktrace is as follows

    19/03/02 17:45:42 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). 
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:
    Sesso e il poliziotto sposato   IT  NA  NA  NA  0[\n]
    tt0097089   4   Sex and the Married Detective   US  NA  NA  NA  0[\n]`tt0100054 1   Fluenes herre   NO  NA  imdbDisplay NA  0
tt0100054   20  Kärpästen herra FI  NA  NA  NA  0
tt0100054   2
    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1000000
    at com.univocity.parsers.common.input.AbstractCharInputReader.appendUtilAnyEscape(AbstractCharInputReader.java:331)
    at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:246)
    at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:119)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:400)
    ... 22 more

I've already tried setting the following in the csv option("maxCharsPerCol","110000000") .option("multiLine","true"), it doesn't help. I'd appreciate any help fixing this.

I'm using spark 2.0.2 & scala 2.11.8.

user4157124
  • 2,809
  • 13
  • 27
  • 42
noobnoob
  • 167
  • 3
  • 15
  • I guess the actual error is the line: `java.lang.ArrayIndexOutOfBoundsException: 1000000`. Somewhere your code is trying to access invalid array index. Could you share only the part of the source code where you guess is the cause of the error? – Soheil Pourbafrani Mar 02 '19 at 13:05
  • I doubt it. It fails on the record. Here's the line in the code:- title_akas1.where($"titleId"==="tt0100054").show(false) – noobnoob Mar 02 '19 at 13:45

3 Answers3

5

Author of univocity-parsers here.

The parser was built to fail fast when something is potentially wrong with either your program (i.e. the file format was not configured correctly) or the input file (i.e. the input file doesn't have the format your program expects, or has unescaped/unclosed quotes).

The stack trace shows this:

Sesso e il poliziotto sposato   IT  NA  NA  NA  0[\n]
tt0097089   4   Sex and the Married Detective   US  NA  NA  NA  0[\n]`tt0100054 1   Fluenes herre   NO  NA  imdbDisplay NA  0
tt0100054   20  Kärpästen herra FI  NA  NA  NA  0
tt0100054   2

Which clearly shows the content of multiple rows being read as if they were part of a single value. This means that somewhere around this text in your input file there are values starting with a quote that is never not closed.

You can configure the parser to not try to handle quoted values with this:

settings.getFormat().setQuote('\0');

If you are sure your format configuration is correct and that there are very long values in the input, set maxCharsPerColumn to -1.

Lastly, it looks like you are parsing TSV, which is not CSV and should be processed differently. If that's the case, you can also try to use the TsvParser instead.

Hope this helps

Jeronimo Backes
  • 6,141
  • 2
  • 25
  • 29
  • Thanks for the reply, Jeronimo. I'm trying to parse a dataset from IMDB which has a lot of foreign characters. Since I'm just reading the file as `val title_akas = sqlContext.read.options(csvOptionsMap).csv(title_akas_filepath)` I'm not sure how to configure the parser settings that you mentioned, explicitly. – noobnoob Mar 04 '19 at 05:08
  • If it has foreign characters I'm pretty sure you need to somehow provide the character encoding, something like `.csv(title_akas_filepath, "UTF-8)`. I'm not familiar with the settings available from spark to help a lot, but I believe there is an `inferschema` option as well which I hope auto-detects the format of what you are parsing. If nothing helps, try using the parser directly. – Jeronimo Backes Mar 04 '19 at 05:13
  • I've the entire code on my Github at [link](https://github.com/SwapnilKamdar/popular-movies/blob/master/src/main/scala/com/kindred/bigdata/assignment/PopularMovies.scala) and although I've already implemented a workaround by processing this file as text and then converting it to a Dataframe. I'd like to know your thoughts if you can take a look at the code from line 63 to 66 and from 80 to 87. – noobnoob Mar 04 '19 at 05:25
  • I had a look at the code but I could not make sense of it as I don't work with Scala. Not sure which version of the library comes with Spark 2.0.2. I saw your comment mentioning you think it's a bug in the parser. If you think that's the case you can try adding an explict dependency to univocity-parsers 2.8.1 in your pom.xml to see if it helps, or use the latest spark version. – Jeronimo Backes Mar 04 '19 at 05:36
  • The reason I feel it's a bug is because I could parse the erroneous record on a reduced data set without making any changes to the parser setting. It worked perfectly. – noobnoob Mar 04 '19 at 05:39
  • The issue seems to be happening due to an unclosed quote. It looks like somewhere in the middle of the file a value starts with `"`. The CSV parser will try to find the closing `"` unless you configure the parser to disregard quotes with something like `sqlContext.read.options(csvOptionsMap).quote('\0');` or something like that. You need to disable quotes in the options as I said in my answer – Jeronimo Backes Mar 04 '19 at 05:45
  • Tried adding the following options, it still fails `val csvOptionsMap = Map("sep" -> "\t", "header" -> "true","inferSchema"->"true","encoding"->"UTF-8","maxCharsPerCol"->"-1","setQuote"->"\0")` The line that you suggested `sqlContext.read.options(csvOptionsMap).quote('\0')` may work but I cannot call the quote function like you suggested. It says _value quote is not a member of org.apache.spark.sql.DataFrameReader_ – noobnoob Mar 04 '19 at 05:51
  • 1
    Maybe `"quote"->"\0"`? – Jeronimo Backes Mar 04 '19 at 05:55
  • Was just about to comment the same. It worked with `"quote"->"\0"`. :) – noobnoob Mar 04 '19 at 05:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189371/discussion-between-swapnil-and-jeronimo-backes). – noobnoob Mar 04 '19 at 06:30
0

Jeronimo's answer will solve this issue.

Just adding a sample code block in case you are wondering how to do this spark.

val tsvData = spark.read.option("header","true").option("inferSchema",
"true").option("delimiter","\t").option("quote","\0").csv(csvFilePath)
Akhil
  • 498
  • 2
  • 6
  • 22
0

For anyone encountering this issue reading wide CSV files within Spark, see https://spark.apache.org/docs/latest/sql-data-sources-csv.html

The CSV reader in Spark has a setting maxColumns which is set to a default of 20480 (as of Spark 3.3).

You can increase this limit by setting it to a number at least as large as the expected number of columns (if known):

spark.read.format("csv").option("header", "true").option("maxColumns", 500000).load(filename)

Keep in mind that there's a tradeoff with increasing maxColumns -- you're preallocating more memory, and so at a certain point, you'll run out of memory from preallocating too much extra space.

Andrew L
  • 1
  • 1