I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:
- How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
- What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the
read_sql
function, but I think that this much data will simply take too long to process in Python.
I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.
Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.