Drop rows in spark which dont follow schema

Question

currently, schema for my table is:

root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- aisle_id: string (nullable = true)
 |-- department_id: string (nullable = true)

I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:

val productsSchema = StructType(Seq(
    StructField("product_id",IntegerType,nullable = true),
    StructField("product_name",StringType,nullable = true),
    StructField("aisle_id",IntegerType,nullable = true),
    StructField("department_id",IntegerType,nullable = true)
  ))

score 1 · Accepted Answer · answered May 14 '20 at 07:14

1

Use option "DROPMALFORMED" while loading the data which ignores corrupted records.

spark.read.format("json")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .schema(productsSchema)
  .load("sample.json")

answered May 14 '20 at 07:14

msrv499

339
1
5

score 0 · Answer 2 · answered May 14 '20 at 04:38

If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.

Used filter to filter ```null`` values for all columns.

scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0

scala> schema.printTreeString
root
 |-- aisle_id: string (nullable = true)
 |-- department_id: string (nullable = true)
 |-- product_id: long (nullable = true)
 |-- product_name: string (nullable = true)


scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]

scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA      |AAD          |1         |sampleA     |
|AAB     |AADB         |2         |sampleBB    |
|CC      |CCC          |3         |sampleCC    |
|dd      |null         |3         |sampledd    |
|null    |null         |null      |null        |
+--------+-------------+----------+------------+


scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA      |AAD          |1         |sampleA     |
|AAB     |AADB         |2         |sampleBB    |
|CC      |CCC          |3         |sampleCC    |
|dd      |null         |3         |sampledd    |
+--------+-------------+----------+------------+


scala>

score 0 · Answer 3 · answered May 14 '20 at 07:06

do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.

scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]

scala> res7.show()
+---+---+----+
| _1| _2|  _3|
+---+---+----+
|  1|  a|   a|
|  1|  a|   a|
|  2|  b|   b|
|  3|  c|   c|
|  4|  d|   d|
|  4|  d|null|
+---+---+----+

//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  a|  a|
|  1|  a|  a|
|  2|  b|  b|
|  3|  c|  c|
|  4|  d|  d|
+---+---+---+

//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  a|  a|
|  1|  a|  a|
|  2|  b|  b|
|  3|  c|  c|
|  4|  d|  d|
+---+---+---+

//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2|  _3|
+---+---+----+
|  1|  a|   a|
|  1|  a|   a|
|  2|  b|   b|
|  3|  c|   c|
|  4|  d|   d|
|  4|  d|null|
+---+---+----+

//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  a|  a|
|  1|  a|  a|
|  2|  b|  b|
|  3|  c|  c|
|  4|  d|  d|
+---+---+---+

Drop rows in spark which dont follow schema

3 Answers3

Linked