0

Currently, we have a data pipeline that takes about 1 week to compute in parallel (written in python) and occupies about 4 TB. We would like to research if the data representation we chose is optimal or not and how different architectures perform on the different data representation. Thus, we would like to be able change the data representation and build models on these new data representations. However, to make a new representation we would have to wait 7 days maybe longer depending on the changes we made and then begin training our model on the data (3 days of training as of the current model). Thus, our current cycle requires 10 days to generate and then train our neural net. This is very time and space expensive if we would like to explore 50 different data representations.

Thus, we have began rewriting the entire data pipeline in C++ to speed up the data generation. While this alleviates the time cost of generating the data, it does not solve our issue of storing that much data. We would much rather save the metadata used to generate the data.

Is it possible for us to generate the data and pass it straight to tensorflow without ever writing it to disk?

The supercomputers we have access to have 16 CPUs and 4 GPUs per node. So we were wondering if we could train a model on the GPUs and generate the next batch of data on the CPUs while the net is training.

I have not been able to find anything on the internet that addresses this specific question. I am wondering if this functionality is built into tensorflow or pytorch? We are almost done rewriting the data pipeline in C++ and are a few weeks away from trying to access the data from tensorflow. Again, our whole goal is never write the data to disk.

Danny Diaz
  • 375
  • 1
  • 4
  • 12
  • I don't have your code so I can't tell exactly why is writing to disk is necessary, but would something like [this](https://medium.com/@ruhshan_ahmed/how-to-train-with-tensorflow-without-hurting-ram-dd617038872) work? – user202729 Aug 29 '19 at 00:43
  • So that suggests using a generator. To read data one at a time from a file. We don't have a file to read from. All our data is going to be residing in a queue in RAM (192GB of RAM per node) that the neural net can feed from and we can keep adding to the back fo the queue. – Danny Diaz Aug 29 '19 at 00:48
  • If you switch python to C++ on all heavy labor functions, you'll get a speed up from 1 week to under a day (unless you already use C++ for most CPU intensive tasks). You don't need to change most of the code, you only need to make C++ based DLLs for the heavy stuff. – ALX23z Aug 29 '19 at 02:04
  • Yeah so that's what we are in the middle of doing. However, how do I get around not saving any of the data but yet still training? – Danny Diaz Aug 29 '19 at 02:05
  • 1
    @DannyDiaz I am not familiar with tensorflow, but it sounds odd to me that you need to save the data on disk to process it on GPU. You'll either need to implement the code yourself (copy paste from them and adapt) or you gotta check out their API, they ought to have better ways to work with the data. – ALX23z Aug 29 '19 at 02:15
  • I can't imagine that I am the first person wanting to pass real time computed data to tensorflow for training. I just have not found anything online where someone does this. I will have to spend more time carefully looking through the API. Writing this myself in CUDA/C++ will be a last resort. – Danny Diaz Aug 29 '19 at 02:47

0 Answers0