I'm looking to validate each row of a csv file of more than 600 million rows and up to 30 columns (the solution must process several large csv file of that range).
Columns can be text, dates or amounts. The csv must be validated with 40 rules, some rules will check the correctness of the amout, some of them will check the dates (the format), etc…
The results of each validation rule must be saved and will be displayed afterwards.
Once the data validated, a second stage of validation rules will be applied, based this time on sum, averages … the results also of each rule must be saved.
I’m using Spark to load the file. With
session.read().format("com.databricks.spark.csv").option("delimiter",
"|").option("header", "false").csv(csvPath)
Or
session.read().option("header", "true").text(csvPath);
To iterate on each line I see that there is two options:
- Use
dataset.map( row -> { something });
“Something” should validate each row and save the result somewhere
But as the “something” block will be executed in the executors, I don’t see how to get it back to the driver or store it somewhere from where it can be retrieved from the Driver process.
- The second option is to use
dataset.collect
: but It will cause an outofmemory as all the data will be loaded in the Driver. We could use the method “take” and then delete the subset from the dataset (with a filter) and repeat the operation but I'm not comfortable with this method
I was wondering if someone can suggest me a robust method to deal with this kind of problem. Basically keep Spark for the second stage of validation rules and use Spark or another framwrok to ingest the file and execute and produce the first set of validation rule
Thanks in advance for your help