I'm trying to use pyspark csv reader with the following criteria:
- Read csv according to datatypes in schema
- Check that column names in header and schema matches
- Store broken records in a new field
Here is what I have tried.
file: ab.csv
------
a,b
1,2
3,four
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
DDL = "a INTEGER, b INTEGER"
df = spark.read.csv('ab.csv', header=True, schema=DDL, enforceSchema=False,
columnNameOfCorruptRecord='broken')
print(df.show())
Output:
+----+----+
| a| b|
+----+----+
| 1| 2|
|null|null|
+----+----+
This command does not store the corrupted records. If I add broken
to
the schema and remove header validation the command
works with a warning.
DDL = "a INTEGER, b INTEGER, broken STRING"
df = spark.read.csv('ab.csv', header=True, schema=DDL, enforceSchema=True,
columnNameOfCorruptRecord='broken')
print(df.show())
Output:
WARN CSVDataSource:66 - Number of column in CSV header is not equal to number of fields in the schema:
Header length: 2, schema size: 3
CSV file: file:/// ... /ab.csv
+----+----+------+
| a| b|broken|
+----+----+------+
| 1| 2| null|
|null|null|3,four|
+----+----+------+
Is this intended behavior or is there a bug that breaks the first example? Is there a better way to do this?
One more thing. I want to process well-formed fields in corrupted records to get a dataframe like this.
+--+----+------+
| a| b|broken|
+--+----+------+
| 1| 2| null|
| 3|null|3,four|
+--+----+------+
Should I make an extra step post reading to get that, or is there some option I have missed to be more permissive.