I am trying to run two python scripts as described below.
This code is designed to download data from a specific URL of the file.
years = ["2013","2014","2018","2019"] for year in years: code(year)
In this case, code is the downloading function. I want to download multiple years of data. On a normal computer, it takes about 26 hours to download each year.
This code is designed to process the data downloaded from the above code. The above code should have finished executing before this code is executed.
years = ["2013","2014","2018","2019"] for year in years: data(year)
In this case, data is the data processing function. I want to download multiple years of data. On a normal computer, it takes about 24 hours to download each year.
So I have access to access to a supercomputer which can allow me to use 10 nodes with 36 cores each summing up to a total of 360 cores with the provision to run 4 jobs at a time for up to 24 hours.
I intend to run two jobs in a queue, i.e. first download the data and the second job is to processes the data. I intend to use multiple cores and nodes to optimally minimize the execution time to download and process EACH year of data. I was informed that use of multiple cores and nodes needs to be integrated into the actual code.
I would really appreciate any suggestion on how I could minimize the execution time based on the available resources and how exactly I could implement it in the code. I looked into the multiprocessing library but I wasn't able to quite implement it.
The data is downloaded from the link below. drive.google.com/open?id=1TdiPuIjj7u-arACMPh5yVeegcJ-y3fLr The data for each year is about 6 GBs, I believe that the downloading is taking too long just because the code has to check if each URL is valid or not and it goes through about 100,000 URL per year. I was hoping that using the supercomputer would allow me to download all years simultaneously in 1 year's time. downloading code. drive.google.com/open?id=1TdiPuIjj7u-arACMPh5yVeegcJ-y3fLr
the data processing code is solely just processing the data by converting it from to csv files and later using pandas to apply filters and thresholds. I was hoping to process all the years simultaneously. The code is taking too long just because it's processing ALOT files like about 100,000 files per year. I was hoping to process all the years simultaneously on the supercomputer.