-1

We've created a folder in my dad's computer for everyone in the family to deposit and share their photos and videos.

Example of directories:
/Family_Photo/Penguins/2017 09 02/
/Family_Photo/East Beach/2017 10 11/Seaside/
/Family_Photo/East Beach/2017 10 11/Games/

Using md5deep, I am able to create a complete list of checksum for all the files in all subdirectories

md5deep -r /Family_Photo/ > /Family_Photo/md5sum.log

Instead of every time regenerating the complete md5 checksum for all (newly added and existing) files,

How can I create a bash script to automatically detect any files that has not been md5 before and generate the checksum for these new files and append them the original md5sum.log

Mich
  • 3,253
  • 2
  • 14
  • 6
  • 1
    What have you tried so far? Show us some code. – Maxime Chéramy Jun 03 '18 at 09:53
  • I don't know if md5deep can do that. If not, you could use `find` to filter files based on modification date, then md5 those and somehow patch the main md5 list. If you want to really only append previously non-existant files (and ignore changes), then you'd need to filter based on names in existing list. – Jay Jun 03 '18 at 10:10
  • Traversing the directory tree to find the files is the expensive part of the operation; recalculating the checksum for files you already have a checksum for is negligible by comparison, and thus an unnecessary optimization. – tripleee Jun 03 '18 at 11:01
  • I'm intrigued as to *why* anyone would want to md5 their photos? – Mark Setchell Jun 03 '18 at 11:55

3 Answers3

1

Solution

This should do the trick:

comm -1 -3 <(grep --text --perl-regex --only-matching '(?<= ).+' /Family_Photo/md5sum.log | sort) <(find /Family_Photo -type f | sort) | xargs --delimiter='\n' --no-run-if-empty md5deep | tee -a /Family_Photo/md5sum.log

Notes

  • If you use a different path than the one in the example, make sure to use an absolute and canonical path or append the option -exec realpath {} \; to find, because md5deep seems to write such paths into the file and we need them to be identical for comparison.
  • This command line uses bash specific syntax (passing commands as files) and might not work in different shell interpreters.

Explanation

  • comm -1 -3
    • We use this command in this specific case to see which files are new by comparing found files to the existing list.
    • comm compares two sorted lists and outputs which lines are unique to each and which are common to both
    • -1 means: don't show lines unique to first list
    • -3 means: don't show lines common to both files
    • as a result we only output lines unique to second list
  • <(grep --text --perl-regex --only-matching '(?<= ).+' /Family_Photo/md5sum.log | sort) As first file to comm we pass a list of the already hashed filenames.
    • <(...) is bash syntax to pass the result of a program as file argument
    • With grep we extract the file names from the existing file by matching whatever follows double-space
    • --text makes sure md5sum.log is always considered a text file and not skipped
    • --perl-regex use perl regular expression syntax (we need this for look-behind matching)
    • --only-matching only output text that matched the pattern, not the entire line with the match
    • '(?<= ).+' the matching pattern: (?<= ) "look-behind" pattern, checks if match was preceded by (two spaces); followed by .+ (any characters, one or more)
    • | sort we pass the output of grep to sort, because comm expects sorted lists
  • <(find /Family_Photo -type f | sort) As second file to comm we pass all files we find
    • <(...) is bash syntax to pass the result of a program as file
    • find will recurse a given directory and print out all file names
    • -type -f instructs find to only output the names of found files, not directories
    • | sort we pass the output of grep to sort, because comm expects sorted lists
  • | xargs --delimiter='\n' --no-run-if-empty md5deep The resulting list of new files is passed to md5deep
    • | connects the output of comm to the input of xargs
    • xargs will call a command (in this case md5deep) with whatever comes as input as argument
    • --delimiter='\n' specifies a new line as seperator, so that other whitespaces in file names won't get mistaken for a new argument
    • --no-run-if-empty we don't want to run md5deep if we don't have a single new filename to pass to it.
  • | tee --append /Family_Photo/md5sum.log The resulting list hashes will be written to the hash file
    • This displays the new files/hashes for your convenience while writing them, if you don't want to see them, just use >> /Family_Photo/md5sum.log instead.
    • | connects the output of md5deep to the input of tee
    • tee will output its input and also write it to a file
    • --append tells tee to not overwrite file contents, but to append instead
Jay
  • 3,640
  • 12
  • 17
1

Thanks all for the input. After much struggling, I've come up something that meets my current needs.

This part is run for only the first time

md5deep -r /Family_Photos/ > /Family_Photo/photos.md5
cd Family_Photos/ & find . -print | sort > today.txt

The next part will form my script. Preparing txt files for every run.

cd Family_Photos/ & rm old.txt & mv today.txt old.txt

To list all files recursively into today.txt

find . -print | sort > today.txt

Update the newly added files into new.txt

grep -xvFf old.txt today.txt > new.txt

Generate md5sum of all new files and append into photos.md5

cat new.txt | xargs -d '\n' md5sum >> photos.md5
Mich
  • 3,253
  • 2
  • 14
  • 6
-1

I'll take an ls -l (and store it in a tempfile),
then diff it by a new ls on a Daily? basis, if diff returns 0 all is fine, if diff shows differences.
Then I'll md5 only files reported by diff, update the ls tempfile with the new ls. I'll use the --LTYPE-line-format=%< so it won't look for files removed (files present in the tempfile but not in fresh-run ls).

this will be the presudo-code for finding 'new' files:

new_files=diff --suppress-common-lines --changed-group-format='%<' --unchanged-group-format='' temp_file $(ls -l)

deleted_files=diff --suppress-common-lines --changed-group-format='%>' --unchanged-group-format='' temp_file $(ls -l) #so you can log deletions too

I leave to write the other code (make first tempfile and hash the data)

Obviously if you have a directory you have to run ls -R than run the script from the root of the path you want to keep checkd

DDS
  • 2,340
  • 16
  • 34
  • And that's not how you assign the output of a command to a variable (though not using variables would be much better). – tripleee Jun 03 '18 at 10:59
  • [Why *not* parse `ls`?](http://unix.stackexchange.com/questions/128985/why-not-parse-ls) – Cyrus Jun 03 '18 at 11:56
  • 1> i wrote pseudo-code (corrections you posted are fine grain) 2> if it's for safety is enough to use ' ' around ls output to tell bash "treat output as a string without interpreting it" 3> i assume that, being a home-use PC no user is intrested in injecting harmful code as filename 3> it's not important 'how' output is displayed, what is important is that it's consistent between different LS executions 4> if you feel it to be necessary replace ls with for i in *; do; echo "$i"; – DDS Jun 03 '18 at 12:05