0

I am trying to run two python scripts as described below.

This code is designed to download data from a specific URL of the file.

years = ["2013","2014","2018","2019"] for year in years: code(year)

In this case, code is the downloading function. I want to download multiple years of data. On a normal computer, it takes about 26 hours to download each year.

This code is designed to process the data downloaded from the above code. The above code should have finished executing before this code is executed.

years = ["2013","2014","2018","2019"] for year in years: data(year)

In this case, data is the data processing function. I want to download multiple years of data. On a normal computer, it takes about 24 hours to download each year.

So I have access to access to a supercomputer which can allow me to use 10 nodes with 36 cores each summing up to a total of 360 cores with the provision to run 4 jobs at a time for up to 24 hours.

I intend to run two jobs in a queue, i.e. first download the data and the second job is to processes the data. I intend to use multiple cores and nodes to optimally minimize the execution time to download and process EACH year of data. I was informed that use of multiple cores and nodes needs to be integrated into the actual code.

I would really appreciate any suggestion on how I could minimize the execution time based on the available resources and how exactly I could implement it in the code. I looked into the multiprocessing library but I wasn't able to quite implement it.

The data is downloaded from the link below. drive.google.com/open?id=1TdiPuIjj7u-arACMPh5yVeegcJ-y3fLr The data for each year is about 6 GBs, I believe that the downloading is taking too long just because the code has to check if each URL is valid or not and it goes through about 100,000 URL per year. I was hoping that using the supercomputer would allow me to download all years simultaneously in 1 year's time. downloading code. drive.google.com/open?id=1TdiPuIjj7u-arACMPh5yVeegcJ-y3fLr

the data processing code is solely just processing the data by converting it from to csv files and later using pandas to apply filters and thresholds. I was hoping to process all the years simultaneously. The code is taking too long just because it's processing ALOT files like about 100,000 files per year. I was hoping to process all the years simultaneously on the supercomputer.

  • The link to the data is http://chain.physics.unb.ca/data/gps/ismr/ – chintan thakrar Feb 29 '20 at 17:38
  • since downloading files is an io intensive operation instead of cpu intensive, i doubt you will see any benefit downloading on a supercomputer unless it has better internet bandwidth, the processing is a completely cpu intensive operation, though. regarding your implementation, you are on the correct path with multiprocessing library, i would suggest using a ProcessPoolExecutor to fully benefit from the multiple cores – syfluqs Feb 29 '20 at 17:57
  • @syfluqs Thank you for your response. Can I execute a single method call on more than one core? For example, is it possible to run data("2018") on like 3 cores? Does it help reduce the execution time? Could I implement using methods of the ProcessPoolExecutor class? – chintan thakrar Feb 29 '20 at 20:08
  • CPython releases the GIL during most IO operations so you can use multithreading fairly easily to speed this up (if you're actually IO bound). Search for things like "python web scraping concurrent futures" or asyncio – Nick Becker Feb 29 '20 at 21:05
  • @chintanthakrar, as stated by Nick Becker above, multithreading will help you with io-bound ops and asyncio might not be desirable for you since it generally only works in a event loop in a single-thread environment. About your query for running a function in multiple cores, yes it is possible, you can use a PoolProcessExecutor with multiple workers and assign each of them to execute methods with a different parameter list. Also note that, it is not necessary that all the processes in the pool will acquire different cores, that is upto the OS to assign based on several parameters. – syfluqs Mar 01 '20 at 00:09
  • @syfluqs thank you for your comment. I understand that I can assign each worker with a different parameter. What was concerned about is about exexyring each worker to multiple cores. What do you think about the chunksize argument? What do you think is an appropriate number for the chunksize? Should I just use like 10.000 or the number of cores available? – chintan thakrar Mar 01 '20 at 00:45
  • chunk size is the size of chunks that your iterable will be divided by the executor to feed into each process, this depends on your data (iterable) and how much of the data you want each process worker to execute at a time – syfluqs Mar 01 '20 at 00:49
  • @syfluqs, thanks for your professional opinion. So our data for each year is split up into 365 folders and each day folder is further split into 24 sub folders. We set up loops to walk through each directory and process each file. Based on my understand I think it would be a good idea to set the chunksize to a really huge number so more processors are used and each processors does less work hence decrease the execution speed. What do you think? – chintan thakrar Mar 01 '20 at 00:57
  • the number of processors to be used is bound by `max_workers`. chunk size only controls how big of a portion of the complete data is given to each of the worker to process – syfluqs Mar 01 '20 at 02:12
  • Oh I see how this works. In that case, what would you suggest i use for the max_worker argument? – chintan thakrar Mar 01 '20 at 03:04
  • @syfluqs, if I leave the value of max_workers unassigned will it automatically detect the number of cores in the server and use that value or is it important for me to state the value?? – chintan thakrar Mar 01 '20 at 05:09
  • if not provided `max_workers` defaults to 5 X number of processor cores, but i reckon you will have to play with this number for a bit to find the sweet spot – syfluqs Mar 01 '20 at 06:14

0 Answers0