I am new to DVC, and so far I like what I see. But possibly my question is fairly easy to answer.
My question: how do we correctly track the dependencies to files in an original hugedatarepo (lets assume that this can also change) in a derivedData project, but WITHOUT the huge files being imported generally when the derived data is checked out? I don't think I can use dvc import
to achieve this.
Details: We have a repository with a large amount of quite big data files (scans) and use this data to design and train various algorithms. Often we want to use only specific files and even only small chunks from within the files for training, annotation and so on. That is, we derive data for specific tasks, that we want to put in new repositories.
Currently my Idea is to dvc get
the relevant data, put it in a untracked temporary folder and then again manage the derived data with dvc. But still to put in the dependency to the original data.
hugeFileRepo
+metaData.csv
+dataFolder
+-- hugeFile_1
...
+-- hugeFile_n
in the derivedData repository I do
dvc import hugeFileRepo.git metaData.csv
dvc run -f derivedData.dvc \
-d metaData.csv \
-d deriveData.py \
-o derivedDataFolder \
python deriveData.py
My deriveData.py does something along the line (pseudocode)
metaData = read(metaData.csv)
#Hack because I don't know how to it right:
gitRevision = getGitRevision(metaData.csv.dvc)
...
for metaDataForFile, file in metaData:
if(iWantFile(metaDataForFile) ):
#download specific file
!dvc get --rev {gitRevision} -o tempFolder/{file} hugeFileRepo.git {file}
#do processing of huge file and store result in derivedDataFolder
processAndWrite(tempFolder/file)
So I use the metaData file as a proxy for the actual data. The hugeFileRepo data will not change frequently and the metaData file will be kept up to date. And I am absolutely fine with having a dependency to the data in general and not to the actual files I used. So I believe this solution would work for me, but I am sure there is a better way.