3

I am developing an application that performs data quality checks over input files and captures counts based on reported DQ failures in the data. Does the approach I use make sense or would recommend better way to do this?

I am trying to write an application in Python that would capture DQ errors in the data and gather counts. I could have used Pandas, Numpy for this, however, since the data quantity is huge ~100 GB I decided to do it through Spark. This is my 3rd application in Python so while i can write code in it, I am out of my depth if this really is the optimum way to do it.

So to summarize, I am reading multiple CSV files and creating a single Parquet file over it then creating a temp view that i could query to find DQ issues. I then capture the results of the query in a variable, and then write it to a list. This list is later used to write a CSV that becomes input for a dashboard report. Code below.

# Importing required libraries
import time,datetime
from pyspark.sql import SparkSession

# Initiating Spark Session
spark = SparkSession.builder.appName("DQ-checks").getOrCreate()

# Initializing Variables
time1 = datetime.datetime.now()
src_file_01 = r'\All kinds of data files\craigslistVehicles.csv'
target_filename = r'\All kinds of data files\craigslistVehicles.parquet'

# Read the CSV file through Spark
df = spark.read.csv(src_file_01, header="true", inferSchema="true")

# Update code to make it flexible enough to read multiple files

# Write the contents of the CSV file into a Parquet file
df.write.format("parquet").save(target_filename, mode="Overwrite")
print("created a parquet file")

# Create a temporary view over the Parquet file to query data
df2 = spark.read.parquet(target_filename)
df2.createOrReplaceTempView("craigslistVehicles")

# Create a column list from the header of the Spark View
column_list = df2.columns
print(column_list)

# Capturing time before start of the query for benchmarking
time2 = datetime.datetime.now()
result_store = []
# Iterate through all the columns and capture null counts for each column
rule_type = 'Null Check'
results={}
for column_names in column_list:
    query = "Select count(*) from craigslistVehicles where {} is null".format(column_names)
#    print(query)
    df3 = spark.sql(query).collect()
    for i in df3:
        results.update(i.asDict())
        res_in_num=results['count(1)']
    result_store=[rule_type,column_names,res_in_num]
    print (result_store)

# Next Steps - Update code to add more data quality checks based on requirement.


# Next Steps - Insert results of the queries into a spark table that can be used as a log and becomes an input for a dashboard report.


# Capturing time after end of the query for benchmarking
time3 = datetime.datetime.now()
print("Query time is.{}",time3-time2)
print("Total Job run time is.{}",time3-time1)

# Spark Session Stop
spark.stop()

Currently, this works. I am able to process a 1.1 GB file in under a minute.

My questions are -

Does this design make sense? How would you do it if you had to do this? Is there something obvious I can change to make the code more clean?

Hardy
  • 109
  • 1
  • 9
  • Why do you exactly needs read all csv and create one parquet file from them and then read the parquet file to create a temp view? I think you are doing an unnecessary extra work – Kafels Jul 09 '19 at 20:01
  • And to count nulls values from all columns can be done with one select, see this example: [How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?](https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe) – Kafels Jul 09 '19 at 20:03
  • Just in addition to Kafels, in general your should try to avoid calling `collect()`. This calls all the data to the spark driver and worsens your performance. – cronoik Jul 09 '19 at 23:31
  • This might be worth checking out as an alternative for doing data quality checks with spark: https://github.com/timgent/data-flare – Timmeh Aug 16 '20 at 11:42

1 Answers1

0

You should consider the following strategies for data quality in your environment: embedding quality checks into your ETL process, integration with alerting systems, implementing both record-level and batch-level checks, and creating profiling check code generation tools. Then you’re set up for success.

nefo_x
  • 3,050
  • 4
  • 27
  • 40