-1

If I am copying a file and then comparing it back:

import shutil, filecmp

# dummy file names, they're not important
InFile = "d:\\Some\\Path\\File.ext"
CopyFile = "d:\\Some\\other\\Path\\File_Copy.ext"

# copy the file
shutil.copyfile(InFile,CopyFile)

# compare the two files
if not filecmp.cmp(InFile,CopyFile,shallow=False):
    print "File not copied correctly"

Why? It seems kind of pointless doesn't it? After all I've just copied the file it has to be identical, doesn't it? wrong! Hard drives have an acceptable error rate that's very small but still present. The only way to be sure is to re-read the file but as it's just been in memory how can I be sure that the system (Windows 7) has actually read the file from the media and not just returned the page from standby memory?

Let's assume that I've got to write 16 TB of data to removable hard disc drives and I have to be sure that none of the files on the disc are corrupt - or at least no more corrupt than the live files. In that 16 TB of disc space there is likely to be a few files that are not identical; I am currently using WinDiff to check the files byte-for-byte but that file comparison utility is slow, but at least I can be reasonably sure that it's actually reading the file that was copied from the disc as the page should be long gone.

Can anybody offer an expert opinion, based on certainties, on which is likely to happen: read or remember?

It is suspicious that if I copy less than the installed memory the verification process is quicker than the copy - it should be, reading is quicker than writing, but not that quick. If I copy 3GB of files (I have 32 GB installed memory) and it takes a minute then verification should take 50 seconds or so and should be 100% disc use on resource monitor.. it's not, the verification takes less than 10 seconds and resource monitor doesn't budge. If I copy more than the installed memory then verification takes almost as long and the resource monitor shows 100% - what I'd expect! So what's happening here?

For reference, the real code with error checking removed:

import shutil, filecmp, os, sys

FromFolder = sys.argv[1]
ToFolder   = sys.argv[2]

VerifyList = list()
VerifyToList = list()

BytesCopied = 0

if not os.path.exists(ToFolder):
    os.mkdir(ToFolder)

for (path, dirs, files) in os.walk(FromFolder):
    RelPath = path[len(FromFolder):len(path)]
    OutPath = ToFolder + RelPath

    if not os.path.exists(OutPath):
        os.mkdir(OutPath)

    for thisFile in files:
        InFile = path + "\\" + thisFile
        CopyFile = OutPath + "\\" + thisFile

        ByteSize = os.path.getsize(InFile)
        if ByteSize < 1024:
            RepSize = "%d bytes" % ByteSize
        elif ByteSize < 1048576:
            RepSize = "%.1f KB" %  (ByteSize / 1024) 
        elif ByteSize < 1073741824:
            RepSize = "%.1f MB" %  (ByteSize / 1048576)
        else:
            RepSize = "%.1f GB" %  (ByteSize / 1073741824)

        print "copy %s > %s " % (RepSize, thisFile)

        VerifyList.append(InFile)
        VerifyToList.append(CopyFile)

        shutil.copyfile(InFile,CopyFile)

# finished copying, now verify
FileIndex = range(len(VerifyList))
reVerifyList = list()
reVerifyToList = list()

for thisIndex in FileIndex:
    InFile = VerifyList[thisIndex]
    CopyFile = VerifyToList[thisIndex]

    thisFile = os.path.basename(InFile)
    ByteSize = os.path.getsize(InFile)

    if ByteSize < 1024:
        RepSize = "%d bytes" % ByteSize
    elif ByteSize < 1048576:
        RepSize = "%.1f KB" %  (ByteSize / 1024) 
    elif ByteSize < 1073741824:
        RepSize = "%.1f MB" %  (ByteSize / 1048576)
    else:
        RepSize = "%.1f GB" %  (ByteSize / 1073741824)

    print "Verify %s > %s" % (RepSize, thisFile)

    if not filecmp.cmp(InFile,CopyFile,shallow=False):
        #thisFile = os.path.basename(InFile)
        print "File not copied correctly " + thisFile
        # copy, second chance
        reVerifyList.append(InFile)
        reVerifyToList.append(CopyFile)
        shutil.copyfile(InFile,CopyFile)

del VerifyList
del VerifyToList

if len(reVerifyList) > 0:
    FileIndex = range(len(reVerifyList))
    for thisIndex in FileIndex:
        InFile = reVerifyList[thisIndex]
        CopyFile = reVerifyToList[thisIndex]

        if not filecmp.cmp(InFile,CopyFile,shallow=False):
            thisFile = os.path.basename(InFile)
            print "File failed 2nd chance " + thisFile
Community
  • 1
  • 1
Michael Stimson
  • 314
  • 1
  • 4
  • 19
  • Well, seeing as your device doesn’t have 16TB of available memory… – Ry- Jun 26 '14 at 04:24
  • The array has 19TB of HDD, what I'm asking is if I copy then compare a single file (less than available memory) is it read or returned from page. – Michael Stimson Jun 26 '14 at 04:26
  • 2
    You can never be sure because some modern HDDs have internal buffers (SSDs) which do this transparently - now way for your OS to even recognize it... – OBu Jun 26 '14 at 04:28
  • 3
    If your data needs to be perfect, add checksums frequently. Read errors happen too. – Ry- Jun 26 '14 at 04:28
  • Good point @false, how would one go about that? and OBu, if the drive has 16MB of cache and the file is less than that then it may have made it to the drive but not to the media yet. – Michael Stimson Jun 26 '14 at 04:30
  • correct - the only way to prevent this, is to switch of the write cache for the drive, too. Some drives have (or had, I did not check for a long time) jumper settings for such configruations. – OBu Jun 26 '14 at 04:37
  • @MichaelMiles-Stimson: I’m not terribly familiar with the various error-correction schemes – sorry – but that’s what you’ll probably want. https://en.wikipedia.org/wiki/Error-correcting_code – Ry- Jun 26 '14 at 04:40

1 Answers1

1

If you use an external hard drive, you can switch off the write cache for this drive.

But you can never be 100% sure because some modern HDDs have internal buffers (SSDs) which do buffering transparently - now way for your OS to even recognize it...

OBu
  • 4,977
  • 3
  • 29
  • 45
  • And a gut feeling: If you wait a bit and there is no heavy writing load on your system, the OS and your drive shold take the opportiunity to write the file. So e.g. first writing all files anf afterwards comparing them should do it - but of course, this is just a gut feeling and therefor not part of the answer... – OBu Jun 26 '14 at 04:40
  • That is the basis for the current procedure: copy then verify. If I start the copy before I go home and then start the verify then next day... Problem is it takes soooo long to verify that amount of data. Gut feel is if I copy all the files, keeping their paths in a list, then from the start verify the files it would be safer. BTW, the example code is intended to go in an os.walk() and is not for copying *just one file*. – Michael Stimson Jun 26 '14 at 04:48