0

Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare the quality between each. We are expecting the second is the more reliable but I would like to later on automate the whole process. I was wondering if there is someone here willing to share their data quality python code using Dask. Thank you

doubleD
  • 269
  • 1
  • 12
  • Could you add details and what you've tried? – D Malan Feb 21 '22 at 08:25
  • 1
    Could you please share some pseudo-code or [minimal example](https://stackoverflow.com/help/minimal-reproducible-example)? It'll allow us to help you better. The best way will depend on your personal setup/workflow. From David's answer below, starting with pandas is a good idea, but I'd suggest using [Dask DataFrame's API](https://docs.dask.org/en/stable/dataframe-api.html) to produce a Dask equivalent, instead of `map_partitions` (which can be the final option). You may also consider posting this on Discourse since it's more of a discussion topic: https://dask.discourse.group/ :) – pavithraes Feb 21 '22 at 14:55

1 Answers1

2

I would suggest the following approach:

  • try to define how you would check quality on small dataset and implement it in Pandas
  • try to generalize the process in a way that if each "part of file" or partition is of good quality, than whole dataset can be considered of good quality.
  • use Dask's map_partitions to parralelize this processing over your dataset's partition.
David Gruzman
  • 7,900
  • 1
  • 28
  • 30