How to create md5sum for new files

Question

We've created a folder in my dad's computer for everyone in the family to deposit and share their photos and videos.

Example of directories:
/Family_Photo/Penguins/2017 09 02/
/Family_Photo/East Beach/2017 10 11/Seaside/
/Family_Photo/East Beach/2017 10 11/Games/

Using md5deep, I am able to create a complete list of checksum for all the files in all subdirectories

md5deep -r /Family_Photo/ > /Family_Photo/md5sum.log

Instead of every time regenerating the complete md5 checksum for all (newly added and existing) files,

How can I create a bash script to automatically detect any files that has not been md5 before and generate the checksum for these new files and append them the original md5sum.log

I don't know if md5deep can do that. If not, you could use `find` to filter files based on modification date, then md5 those and somehow patch the main md5 list. If you want to really only append previously non-existant files (and ignore changes), then you'd need to filter based on names in existing list. — Jay, Jun 03 '18 at 10:10
Traversing the directory tree to find the files is the expensive part of the operation; recalculating the checksum for files you already have a checksum for is negligible by comparison, and thus an unnecessary optimization. — tripleee, Jun 03 '18 at 11:01
I'm intrigued as to *why* anyone would want to md5 their photos? — Mark Setchell, Jun 03 '18 at 11:55

Jay · Accepted Answer · 2018-07-05T23:13:47.867

Solution

This should do the trick:

comm -1 -3 <(grep --text --perl-regex --only-matching '(?<= ).+' /Family_Photo/md5sum.log | sort) <(find /Family_Photo -type f | sort) | xargs --delimiter='\n' --no-run-if-empty md5deep | tee -a /Family_Photo/md5sum.log

Notes

If you use a different path than the one in the example, make sure to use an absolute and canonical path or append the option -exec realpath {} \; to find, because md5deep seems to write such paths into the file and we need them to be identical for comparison.
This command line uses bash specific syntax (passing commands as files) and might not work in different shell interpreters.

Explanation

comm -1 -3
- We use this command in this specific case to see which files are new by comparing found files to the existing list.
- comm compares two sorted lists and outputs which lines are unique to each and which are common to both
- -1 means: don't show lines unique to first list
- -3 means: don't show lines common to both files
- as a result we only output lines unique to second list
<(grep --text --perl-regex --only-matching '(?<= ).+' /Family_Photo/md5sum.log | sort) As first file to comm we pass a list of the already hashed filenames.
- <(...) is bash syntax to pass the result of a program as file argument
- With grep we extract the file names from the existing file by matching whatever follows double-space
- --text makes sure md5sum.log is always considered a text file and not skipped
- --perl-regex use perl regular expression syntax (we need this for look-behind matching)
- --only-matching only output text that matched the pattern, not the entire line with the match
- '(?<= ).+' the matching pattern: (?<= ) "look-behind" pattern, checks if match was preceded by (two spaces); followed by .+ (any characters, one or more)
- | sort we pass the output of grep to sort, because comm expects sorted lists
<(find /Family_Photo -type f | sort) As second file to comm we pass all files we find
- <(...) is bash syntax to pass the result of a program as file
- find will recurse a given directory and print out all file names
- -type -f instructs find to only output the names of found files, not directories
- | sort we pass the output of grep to sort, because comm expects sorted lists
| xargs --delimiter='\n' --no-run-if-empty md5deep The resulting list of new files is passed to md5deep
- | connects the output of comm to the input of xargs
- xargs will call a command (in this case md5deep) with whatever comes as input as argument
- --delimiter='\n' specifies a new line as seperator, so that other whitespaces in file names won't get mistaken for a new argument
- --no-run-if-empty we don't want to run md5deep if we don't have a single new filename to pass to it.
| tee --append /Family_Photo/md5sum.log The resulting list hashes will be written to the hash file
- This displays the new files/hashes for your convenience while writing them, if you don't want to see them, just use >> /Family_Photo/md5sum.log instead.
- | connects the output of md5deep to the input of tee
- tee will output its input and also write it to a file
- --append tells tee to not overwrite file contents, but to append instead

score 1 · Answer 2 · answered Jul 05 '18 at 08:21

Thanks all for the input. After much struggling, I've come up something that meets my current needs.

This part is run for only the first time

md5deep -r /Family_Photos/ > /Family_Photo/photos.md5
cd Family_Photos/ & find . -print | sort > today.txt

The next part will form my script. Preparing txt files for every run.

cd Family_Photos/ & rm old.txt & mv today.txt old.txt

To list all files recursively into today.txt

find . -print | sort > today.txt

Update the newly added files into new.txt

grep -xvFf old.txt today.txt > new.txt

Generate md5sum of all new files and append into photos.md5

cat new.txt | xargs -d '\n' md5sum >> photos.md5

DDS · Answer 3 · 2018-06-03T10:51:01.610

-1

I'll take an ls -l (and store it in a tempfile),
then diff it by a new ls on a Daily? basis, if diff returns 0 all is fine, if diff shows differences.
Then I'll md5 only files reported by diff, update the ls tempfile with the new ls. I'll use the --LTYPE-line-format=%< so it won't look for files removed (files present in the tempfile but not in fresh-run ls).

this will be the presudo-code for finding 'new' files:

new_files=diff --suppress-common-lines --changed-group-format='%<' --unchanged-group-format='' temp_file $(ls -l)

deleted_files=diff --suppress-common-lines --changed-group-format='%>' --unchanged-group-format='' temp_file $(ls -l) #so you can log deletions too

I leave to write the other code (make first tempfile and hash the data)

Obviously if you have a directory you have to run ls -R than run the script from the root of the path you want to keep checkd

edited Jun 03 '18 at 10:51

answered Jun 03 '18 at 10:36

DDS

2,340
16
34

And that's not how you assign the output of a command to a variable (though not using variables would be much better). – tripleee Jun 03 '18 at 10:59
[Why *not* parse `ls`?](http://unix.stackexchange.com/questions/128985/why-not-parse-ls) – Cyrus Jun 03 '18 at 11:56
1> i wrote pseudo-code (corrections you posted are fine grain) 2> if it's for safety is enough to use ' ' around ls output to tell bash "treat output as a string without interpreting it" 3> i assume that, being a home-use PC no user is intrested in injecting harmful code as filename 3> it's not important 'how' output is displayed, what is important is that it's consistent between different LS executions 4> if you feel it to be necessary replace ls with for i in *; do; echo "$i"; – DDS Jun 03 '18 at 12:05

How to create md5sum for new files

3 Answers3