Trying to read the below data from a CSV results in a com.univocity.parsers.common.TextParsingException
exception:
B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
Here's the Pyspark (3.1.2) code used to read the data:
from pyspark.sql.dataframe import DataFrame
df = (spark.read.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header","false")
.option("multiline","true")
.option("quote",'"')
.option("escape",'\"')
.option("delimiter",",")
.option("unescapedQuoteHandling", "RAISE_ERROR")
.load('/mnt/source/analysis/error_in_csv.csv'))
This is the exception that I'm getting.
Caused by: com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=true
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=field selection: []
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=RAISE_ERRORFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character="
Quote escape escape character=null
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:623)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:389)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:335)
... 33 more
Caused by: com.univocity.parsers.common.TextParsingException: Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
at com.univocity.parsers.csv.CsvParser.handleValueSkipping(CsvParser.java:241)
at com.univocity.parsers.csv.CsvParser.handleUnescapedQuote(CsvParser.java:319)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:393)
at com.univocity.parsers.csv.CsvParser.parseSingleDelimiterRecord(CsvParser.java:177)
at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:109)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:581)
... 39 more
Can someone please advise? It looks like the quoted delimiter in the second line is causing this. Is there a way to avoid it without changing the source data itself?