2

I'm looking to validate each row of a csv file of more than 600 million rows and up to 30 columns (the solution must process several large csv file of that range).

Columns can be text, dates or amounts. The csv must be validated with 40 rules, some rules will check the correctness of the amout, some of them will check the dates (the format), etc…

The results of each validation rule must be saved and will be displayed afterwards.

Once the data validated, a second stage of validation rules will be applied, based this time on sum, averages … the results also of each rule must be saved.

I’m using Spark to load the file. With

session.read().format("com.databricks.spark.csv").option("delimiter",
         "|").option("header", "false").csv(csvPath)

Or

session.read().option("header", "true").text(csvPath);

To iterate on each line I see that there is two options:

  • Use dataset.map( row -> { something }); “Something” should validate each row and save the result somewhere

But as the “something” block will be executed in the executors, I don’t see how to get it back to the driver or store it somewhere from where it can be retrieved from the Driver process.

  • The second option is to use dataset.collect : but It will cause an outofmemory as all the data will be loaded in the Driver. We could use the method “take” and then delete the subset from the dataset (with a filter) and repeat the operation but I'm not comfortable with this method

I was wondering if someone can suggest me a robust method to deal with this kind of problem. Basically keep Spark for the second stage of validation rules and use Spark or another framwrok to ingest the file and execute and produce the first set of validation rule

Thanks in advance for your help

moun
  • 69
  • 1
  • 6

2 Answers2

0

You can use the SparkSession read the CSV file and then partition the data by a column and process the data in batches. For example you are writing the data to an external DB which does not need much processing.

dataFrame
    .write
    .mode(saveMode)
    .option("batchsize", 100)
    .jdbc(url, "tablename", new java.util.Properties())

If your Business Logic demands you to process each and every Row of a Dataset/Dataframe you can use the df.map(). If your logic can work at once on multiple RDD's you can go with df.mapPartition().Tasks with high per-record overhead perform better with a mapPartition than with a map transformation.

Consider the case of Initializing a database. If we are using map() or foreach(), the number of times we would need to initialize will be equal to the no of elements in RDD. Whereas if we use mapPartitions(), the no of times we would need to initialize would be equal to number of Partitions

wandermonk
  • 6,856
  • 6
  • 43
  • 93
0

You can simply append the columns with check results to your original dataframe and use a bunch of rule UDFs to perform the actual validation, something like this:

    object Rules {
      val rule1UDF = udf(
        (col1: String, col2: String) => {
         // your validation code goes here
         true // the result of validation
      }
    }
    // ...
    val nonAggregatedChecksDf = df
       .withColumn("rule1_result", Rules.rule1UDF("col1", "col2"))
       .withColumn("rule2_result", Rules.rule2UDF("col1", "col3"))
       .select("id", "rule1_result", "rule2_result", <all the columns relevant for the aggregation checks>)

    val aggregatedChecksDf = nonAggregatedChecksDf
       .agg(<...>)
       .withColumn("rule3_result", Rules.rule3UDF("sum1", "avg2"))
       .withColumn("rule4_result", Rules.rule4UDF("count1", "count3"))
       .select("id", "rule1_result", "rule2_result", "rule3_result", "rule4_result")

The second option is to use dataset.collect

I'd advice not to do that but rather select a key field from your original dataframe plus all the check result columns and save them in a columnar format as parquet.

aggregatedChecksDf
    .select("id", "rule1_result", "rule2_result", "rule3_result", "rule4_result")
    .write
    .mode(saveMode)
    .parquet(path)

This will be much faster as the writes are done by all executors in parallel and driver doesn't become a bottleneck. It'll also would most likely help to avoid OOM issues as the memory usage is spread along all the executors.

Denis Makarenko
  • 2,853
  • 15
  • 29