0

Trying to read the below data from a CSV results in a com.univocity.parsers.common.TextParsingException exception:

B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z

Here's the Pyspark (3.1.2) code used to read the data:

from pyspark.sql.dataframe import DataFrame

df = (spark.read.format("com.databricks.spark.csv")
                                     .option("inferSchema", "true")
                                     .option("header","false")
                                     .option("multiline","true")
                                     .option("quote",'"')
                                     .option("escape",'\"')
                                     .option("delimiter",",")
                                     .option("unescapedQuoteHandling", "RAISE_ERROR")
                                     .load('/mnt/source/analysis/error_in_csv.csv'))

This is the exception that I'm getting.

Caused by: com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
Parser Configuration: CsvParserSettings:
    Auto configuration enabled=true
    Auto-closing enabled=true
    Autodetect column delimiter=false
    Autodetect quotes=false
    Column reordering enabled=true
    Delimiters for detection=null
    Empty value=
    Escape unquoted values=false
    Header extraction enabled=null
    Headers=null
    Ignore leading whitespaces=false
    Ignore leading whitespaces in quotes=false
    Ignore trailing whitespaces=false
    Ignore trailing whitespaces in quotes=false
    Input buffer size=1048576
    Input reading on separate thread=false
    Keep escape sequences=false
    Keep quotes=false
    Length of content displayed on error=1000
    Line separator detection enabled=true
    Maximum number of characters per column=-1
    Maximum number of columns=20480
    Normalize escaped line separators=true
    Null value=
    Number of records to read=all
    Processor=none
    Restricting data in exceptions=false
    RowProcessor error handler=null
    Selected fields=field selection: []
    Skip bits as whitespace=true
    Skip empty lines=true
    Unescaped quote handling=RAISE_ERRORFormat configuration:
    CsvFormat:
        Comment character=#
        Field delimiter=,
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character="
        Quote escape escape character=null
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:623)
    at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:389)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
    at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:335)
    ... 33 more
Caused by: com.univocity.parsers.common.TextParsingException: Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
    at com.univocity.parsers.csv.CsvParser.handleValueSkipping(CsvParser.java:241)
    at com.univocity.parsers.csv.CsvParser.handleUnescapedQuote(CsvParser.java:319)
    at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:393)
    at com.univocity.parsers.csv.CsvParser.parseSingleDelimiterRecord(CsvParser.java:177)
    at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:109)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:581)
    ... 39 more

Can someone please advise? It looks like the quoted delimiter in the second line is causing this. Is there a way to avoid it without changing the source data itself?

xuxu
  • 418
  • 5
  • 15

0 Answers0