git: manage input data test practices

Question

With git I manage the changes of a python script (script.py) and a set of tests, this test use some text input data files, with this directory structure

script.py
tests/
  test_01.py
  test_02.py
  data/
   data_file01
   data_file02
   ...

but, some input data files start to be very large ( > 1MB).

with git, Which is a good practive to manage input data for test ?

... maybe allow in a online storage, but, how preserve and check the changes over the input data files ? (suggestions?)

... or maybe use a library like setuptools to check if don't exist the input data test and download this, but, how preserve and check the changes over the input data files ?

EDIT

now I backup the data test in a compressed file with the correspond commit name in a cloud disk (dropbox, google drive, etc), with a line in the post-commit hook

commit_name=$(git rev-parse HEAD)
fecha=$(date +%Y%m%d)
7z a $CLOUD_DISK"/data_test/$fecha"_"$commit_name".7z data/* -r

(I prefer 7z over zip because I get a compressed file of less size)

$CLOUD_DISK variable is defined in the .bashrc.

EDIT 2

I started to work in a more complete way to solve my problem.

https://github.com/juanpabloaj/gitdata

score 0 · Answer 1 · answered Feb 20 '14 at 23:04

0

I'd keep the data in your repo. Your right that you need to track changes in the input data in case they introduce problems. Otherwise, perhaps create a hash of the data like a checksum?

answered Feb 20 '14 at 23:04

Graeme Stuart

5,837
2
26
46

for now I save the data test in a cloud disk (dropbox, google drive, etc) `7z a $CLOUD_DISK"/data_test/$fecha"_"$commit_name".7z data/* -r` ... maybe is necessary add the checksum to 7z file name. – JuanPablo Feb 26 '14 at 15:19

git: manage input data test practices

EDIT

EDIT 2

1 Answers1