0

I have data across several computers stored in folders. Many of the folders contain 40-100 G of files of size from 500 K to 125 MB. There are some 4 TB of files which I need to archive, and build a unfied meta data system depending on meta data stored in each computer.

All systems run Linux, and we want to use Python. What is the best way to copy the files, and archive it.

We already have programs to analyze files, and fill the meta data tables and they are all running in Python. What we need to figure out is a way to successfully copy files wuthout data loss,and ensure that the files have been copied successfully.

We have considered using rsync and unison use subprocess.POPEn to run them off, but they are essentially sync utilities. These are essentially copy once, but copy properly. Once files are copied the users would move to new storage system.

My worries are 1) When the files are copied there should not be any corruption 2) the file copying must be efficient though no speed expectations are there. The LAN is 10/100 with ports being Gigabit.

Is there any scripts which can be incorporated, or any suggestions. All computers will have ssh-keygen enabled so we can do passwordless connection.

The directory structures would be maintained on the new server, which is very similar to that of old computers.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
ramdaz
  • 1,761
  • 1
  • 20
  • 43
  • 1
    Is there a problem with using a sync utility to copy? – zmccord Mar 08 '12 at 13:51
  • No. Not really, but is using rsync or unison inside Python the best recommended way. We need to run the entire process using Python since there's a met-data egenrating program that updates multiple tables in a database – ramdaz Mar 08 '12 at 13:55

3 Answers3

1

I would look at the python fabric library. This library is for streamlining the use of SSH, and if you are concerned about data integrity I would use SHA1 or some other hash algorithm for creating a fingerprint for each file before transfer and compare the fingerprint values generated at the initial and final destinations. All of this could be done using fabric.

snarkyname77
  • 1,154
  • 1
  • 10
  • 23
1

If a more seamless python integration is the goal you can look at,

Duplicity

pyrsync

Joao Figueiredo
  • 3,120
  • 3
  • 31
  • 40
0

I think rsync is the solution. If you are concerned about data integrity, look at the explanation of the "--checksum" parameter in the man page.

Other arguments that might come in handy are "--delete" and "--archive". Make sure the exit code of the command is checked properly.

maxy
  • 4,971
  • 1
  • 23
  • 25