Currently, we have a data pipeline that takes about 1 week to compute in parallel (written in python) and occupies about 4 TB. We would like to research if the data representation we chose is optimal or not and how different architectures perform on the different data representation. Thus, we would like to be able change the data representation and build models on these new data representations. However, to make a new representation we would have to wait 7 days maybe longer depending on the changes we made and then begin training our model on the data (3 days of training as of the current model). Thus, our current cycle requires 10 days to generate and then train our neural net. This is very time and space expensive if we would like to explore 50 different data representations.
Thus, we have began rewriting the entire data pipeline in C++ to speed up the data generation. While this alleviates the time cost of generating the data, it does not solve our issue of storing that much data. We would much rather save the metadata used to generate the data.
Is it possible for us to generate the data and pass it straight to tensorflow without ever writing it to disk?
The supercomputers we have access to have 16 CPUs and 4 GPUs per node. So we were wondering if we could train a model on the GPUs and generate the next batch of data on the CPUs while the net is training.
I have not been able to find anything on the internet that addresses this specific question. I am wondering if this functionality is built into tensorflow or pytorch? We are almost done rewriting the data pipeline in C++ and are a few weeks away from trying to access the data from tensorflow. Again, our whole goal is never write the data to disk.