How can I iterate from a list of source files and locate those files on my disk drive? I'm using FD and RIPGREP

Question

I have a very long list of files stored in a text file (missing-files.txt) that I want to locate on my drive. These files are scattered in different folders in my drive. I want to get whatever closest available that can be found.

missing-files.txt

wp-content/uploads/2019/07/apple.jpg
wp-content/uploads/2019/08/apricots.jpg
wp-content/uploads/2019/10/avocado.jpg
wp-content/uploads/2020/04/banana.jpg
wp-content/uploads/2020/07/blackberries.jpg
wp-content/uploads/2020/08/blackcurrant.jpg
wp-content/uploads/2021/06/blueberries.jpg
wp-content/uploads/2021/01/breadfruit.jpg
wp-content/uploads/2021/02/cantaloupe.jpg
wp-content/uploads/2021/03/carambola.jpg
....

Here's my working bash code:

while read p;
do
    file="${p##*/}"
    /usr/local/bin/fd "${file}" | /usr/local/bin/rg "${p}" | /usr/bin/head -n 1 >> collected-results.txt
done <missing-files.txt

What's happening in my bash code:

I iterate from my list of files
I use FD (https://github.com/sharkdp/fd) command to locate those files in my drive
I then piped it to RIPGREP (https://github.com/BurntSushi/ripgrep) to filter the results and find the closest match. The match I'm looking for should match the same file and folder structure. I only limit it to one result.
Then finally stored it on another text file where I can later then evaluate the lists for next step

Where I need help:

Is this the most effecient way to do this? I have over 2,000 files that I need to locate. I'm open to other solution, this is something I just divised.
For some reason my coded broke, It stopped returning results to "collected-results.txt". My guess is that it broke somewhere in the second pipe right after the FD command. I haven't setup any condition in case it encounters an error or it can't find the file so it's hard for me to determine.

Additional Information:

I'm using Mac, and running on Catalina
Clearly this is not my area of expertise

`For some reason my coded broke` and? You want others guessing why it broke? — KamilCuk, Jun 23 '21 at 10:00
Is ripgrep the correct tool? This generally searches for file content, and it seems you want to search for file names. — kvantour, Jun 23 '21 at 10:16
@KamilCuk Yeah, but not guess as many of you here are smarter than I am. I was hoping you could catch my mistake. My solution was just an improvised solution that I also just scoured around the web and put together. — Pennf0lio, Jun 23 '21 at 15:26
@kvantour I'm not exactly sure if RIPGREP is the right tool - it just worked for a while in my case :/ — Pennf0lio, Jun 23 '21 at 15:28
There are no obvious mistakes in the script - but I _guess_ `fd` depends on the current working directory, so if you're in a different directory..,. — KamilCuk, Jun 23 '21 at 15:31
@KamilCuk I see, thanks! Actually, I'm running it on the root of my user folder so it should capture all available files. — Pennf0lio, Jun 23 '21 at 16:02

Paul Hodges · Answer 1 · 2021-06-23T19:58:11.963

1

"Missing" sounds like they do not exist where expected.
What makes you think they would be somewhere else?

If they are, I'd put the filenames in a list.txt file with enough minimal pattern to pick them out of the output of find.

$: cat list.txt
/apple.jpg$
/apricots.jpg$
/avocado.jpg$
/banana.jpg$
/blackberries.jpg$
/blackcurrant.jpg$
/blueberries.jpg$
/breadfruit.jpg$
/cantaloupe.jpg$
/carambola.jpg$

Then search the whole machine, which is gonna take a bit...

$: find / | grep -f list.txt
/tmp/apricots.jpg
/tmp/blackberries.jpg
/tmp/breadfruit.jpg
/tmp/carambola.jpg

Or if you want those longer partial paths,

$: find / | grep -f missing-files.txt

That should show you the actual paths to wherever those files exist IF they do exist on the system.

edited Jun 23 '21 at 19:58

answered Jun 23 '21 at 14:08

Paul Hodges

13,382
1
17
36

Thank you, Paul! As for the missing files, these are actually a list I've retrieved from my webserver. However, these missing images do exist in my local drive, they're just scattered in different places. What I'm trying to do is collect those missing files paths so I can decide to upload them on my web server. – Pennf0lio Jun 23 '21 at 14:58
This kinda works, however, also returns multiple results if it finds multiple copies in different places. Example: /tmp/apricots.jpg /tmp/blackberries.jpg /documents/blackberries.jpg /pictures/blackberries.jpg Returns all "blackberries.jpg" it can find. I will only need 1 match. The sub-folder (e.g. wp-content/uploads/2019/07/) where it's stored is very important information as it helps strengthen the match relevance. – Pennf0lio Jun 23 '21 at 15:17
Add it into the pattern file then. If you still get multiple hits, we can throw a unique sort on the end of the stream keyed to the filename. – Paul Hodges Jun 23 '21 at 17:41
This seems to not have worked on my end :/ 1.) I get "Operation not permitted" error a lot even using sudo. (ref: https://share.getcloudapp.com/8Luo9X0n) 2.) I'm not entirely sure when say "add it into the pattern" means. Is this correct? https://share.getcloudapp.com/p9uANWQg - I've also tried to escape the slashes and also did not worked. Any idea? Thank you! – Pennf0lio Jun 23 '21 at 19:43
Try just using `missing-files.txt` as the pattern file. Maybe add `$` to the end of each line as an anchor to speed things up and reduce misfires. By "dd it to the pattern file" I meant include that path info, so this does much the same. – Paul Hodges Jun 23 '21 at 19:59

KamilCuk · Answer 2 · 2021-06-23T16:08:48.763

Is this the most effecient way to do this?

I/O is mostly usually the biggest bottleneck. You are running some software fd to find the files for one file one at a time. Instead, run it to find all files at once - do single I/O for all files. In shell you would do:

find . -type f '(' -name "first name" -o -name "other name" -o .... ')'

How can I iterate from a list of source files and locate those files on my disk drive?

Use -path to match the full path. First build the arguments then call find.

findargs=()
# Read bashfaq/001
while IFS= read -r patt; do
    # I think */ should match anything in front.
    findargs+=(-o -path "*/$patt")
done < <(
    # TODO: escape glob better, not tested
    # see https://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#tag_02_13
    sed 's/[?*[]/\\&/g' missing-files.txt
)
# remove leading -o
unset findargs[0]
find / -type f '(' "${findargs[@]}" ')'

Topics to research: var=() - bash arrays, < <(...) shell redirection with process substitution and when to use it (bashfaq/024), glob (and see man 7 glob) and man find.

Thank you @KamilCuk, I'm still trying to make your code work on my end. Is "findargs" need to be a function? as to my understanding, it will iterate and modify my "missing-files.txt" and include a REGEX pattern". I could actually just modify the text file and include the regex pattern. Will that work instead? Where I'm having problem is I can't make the snippet work. For your reference: https://share.getcloudapp.com/p9uANB2Q — Pennf0lio, Jun 23 '21 at 15:47
`"findargs" need to be a function?` It's a bash array. `modify my "missing-files.txt"` the script does not modify anything, only ouptuts. `Will that work instead?` something would have to read and act on that regex - that depends on that something. — KamilCuk, Jun 23 '21 at 16:07

kvantour · Answer 3 · 2021-06-23T10:27:47.813

From the way I understand it, you want to find all files what could match the directory structure:

path/to/file

So it should return something like "/full/path/to/file" and "/another/full/path/to/file"

Using a simple find command you can get a list of all files that match this criteria.

Using find you can search your hard disk in a single go with something of the form:

$ find -regex pattern

The idea is now to build pattern, which we can do from the file missing_files.txt. The pattern should look something like .*/$file1\|file2\|...\|filen$. So we can use the following awk to do so:

$ sed ':a;N;$!ba;s/\n/\|/g' missing_files.txt

So now we can do exactly what you did, but a bit quicker, in the following way:

pattern="$(sed ':a;N;$!ba;s/\n/\|/g' missing_files.txt)"
pattern=".*/\($pattern\)"
find -regex "$pattern" > file_list.txt

In order to find the files, you can now do something like:

grep -F -f missing_files file_list.txt

This will return all the matching cases. If you just want the first case, i.e.

awk '(NR==FNR){a[$0]++;next}{for(i in a) if (!(i in b)) if ($0 ~ i) {print; b[i]}}' missing_files file_list.txt

Thank you @Kvantour! I tried your approach, but I can't seem to make it work as intended. 1.) The first sed pattern doesn't seem to transform the list. It does not add the pipe separator 2.) Do we also need to escape the forward slashes (eg. wp-content\/uploads\/2020\/)? 3.) There might be multiple copies per file, but I will only need one thus the path (e.g. wp-content/uploads/2020) is very important to strengthen the relevance of the results — Pennf0lio, Jun 23 '21 at 17:52

How can I iterate from a list of source files and locate those files on my disk drive? I'm using FD and RIPGREP

3 Answers3