0

I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:

  • How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
  • What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the read_sql function, but I think that this much data will simply take too long to process in Python.

I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.

Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.

Tom Hood
  • 497
  • 7
  • 16
  • 74 million rows is not considered "big data" these days. –  May 28 '19 at 14:13
  • 1
    `date` columns do not have "a format", they are stored in a binary form and formatted when being _displayed_. There is no need to "ensure" a format for a `date` (or `timestamp`) column. –  May 28 '19 at 14:14
  • Sure, it’s not in PB or TB, but I will still need to implement big data practices to have a efficient solution. Simply iterating over millions of data points will not work. – Tom Hood May 28 '19 at 14:15
  • To validate the "format" of e.g. a text or varchar column you could apply a regex. And yes, iterating of millions of rows in SQL **will** work. –  May 28 '19 at 14:15
  • @a_horse_with_no_name I offered the date example as... an example. There are many other checks that I need to perform that will likely be string comparisons. Maybe instead of offering small corrections to my question, you could offer a solution? – Tom Hood May 28 '19 at 14:17
  • As I said: use a regex or any other valid SQL expression e.g. to check that all values are greater than zero, use `select count(*) where some_column > 0` or to make sure that a column always starts with a character use `where col ~ '^[a-z]+'` - but without some examples what kind of checks you **really** want to do this is impossible to answer. –  May 28 '19 at 14:20
  • @a_horse_with_no_name I see. If I want the shortest possible run time, are SQL commands going to be the fastest option for these checks? I will probably only need to use regex. In terms of re-usability what are my options? Just export the queries I run to a file? – Tom Hood May 28 '19 at 14:43
  • 2
    @TomHood, it's hard to offer a solution when the question is so vague. We can't tell, for example if this is something where simply adding constraints to the columns will work, or indexes for your speed concerns. – jmelesky May 28 '19 at 14:43
  • @TomHood Is this a one-time job or a repeating workload? Does the table change substantially or incrementally between runs? What's your target runtime, and why that? What kind of compute resources do you have? What's your target parallelism, and why? Are all 150 columns "text", or do you have numeric or other types? – jmelesky May 28 '19 at 19:20

0 Answers0