find - grep taking too much time

Question

First of all I'm a newbie with bash scripting so forgive me if i'm making easy mistakes.

Here's my problem. I needed to download my company's website. I accomplish this using wget with no problems but because some files have the ? symbol and windows doesn't like filenames with ? I had to create a script that renames files and also update the source code of all files that calls the rename file.

To accomplish this I use the following code:

find . -type f -name '*\?*' | while read -r file ; do
 SUBSTRING=$(echo $file | rev | cut -d/ -f1 | rev)
 NEWSTRING=$(echo $SUBSTRING | sed 's/?/-/g')
 mv "$file" "${file//\?/-}"
 grep -rl "$SUBSTRING" * | xargs sed -i '' "s/$SUBSTRING/$NEWSTRING/g"
done

This is having 2 problems.

This is taking way too long, I've waited more than 5 hours and is still going.
It looks like is doing a append in the source code because when i stop the script and search for changes the URL is repeated like 4 times ( or more ).

Thanks all for your comments, i will try the 2 separete step and see, also, just as FYI, there are 3291 files that were downloaded with wget, still thinking that using bash scripting is prefer over other tools for this?

Are you sure it's actually running, not just waiting for input? — Biffen, Oct 06 '16 at 17:25
Note that it is more likely that the '?' characters in some of your URLs introduce a query string. That would indicate that the underlying resource is probably dynamic, and might return different content at different times. — John Bollinger, Oct 06 '16 at 17:27
You can do an incremental debug by first echo'ing the file found from the find command then add the other operations. — Inian, Oct 06 '16 at 17:27
for each file you perform a `grep -rl "$SUBSTRING" * | xargs sed -i '' "s/$SUBSTRING/$NEWSTRING/g"` which processes all the files from the directory you're running it from That takes a long time and is useless. — Jean-François Fabre, Oct 06 '16 at 17:30
For each rename you perform, you read all of the downloaded files. I/O is expensive, so this is very wasteful. — John Bollinger, Oct 06 '16 at 17:30
ok, I get it. But I see another issue: the `?` will be interpreted as "0 or 1" in your regexes. You have to escape them!! another hindrance. I think some python script would be more appropriate. Also can you explain the `rev` command? — Jean-François Fabre, Oct 06 '16 at 17:39
@JohnBollinger so do you think there's a better way?, or i should try another approach like a python or ruby script? — Leobardo Mora Castro, Oct 06 '16 at 17:45
@Jean-FrançoisFabre the rev command is because i only need the part of the URL that doesn't contain the whole folder structure that wget made, i only need the name of the file itself to rename it in the source code. I can do python if is better, just thought this script might need some tweaking to do better. — Leobardo Mora Castro, Oct 06 '16 at 17:48
@Jean-FrançoisFabre `?` is not a regex metacharacter in BRE, i.e. plain `grep`. It would make sense to use `grep -F` here, though. — tripleee, Oct 06 '16 at 17:54
The `rev` is apparently a really wasteful way of doing `${line##*/}`. The `basename` command is also available if you desperately want to waste an external process, but if the task here is to optimize the script, using native Bash would seem like the way to go. — tripleee, Oct 06 '16 at 17:57
@tripleee good point but in `sed` it _is_ a metacharacter. Scratch that: it needs escaping to work. So all good!!! — Jean-François Fabre, Oct 06 '16 at 19:14
@Jean-FrançoisFabre: i have to agree with tripleee, `?` is not a BRE meta-character. If your `sed` supports it then either it is non-standard or you are using ERE. See http://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd_chap09.html#tag_09_03_06 — cdarke, Oct 06 '16 at 19:18
@LeobardoMoraCastro, the fundamental problem is not the tools you are using but your algorithm. You could consider doing the job in two steps: (1) convert all file names as needed; (2) update contents of all files. This would probably involve dynamically creating a single sed or awk script during step 1 with which to perform the content edits on all files. You might even consider blindly editing *every* file instead of `grep`ing each one first to see whether it needs to modified. The point is to avoid processing any file's contents more than once. — John Bollinger, Oct 06 '16 at 19:20
@JohnBollinger thanks for the advice, but this poses this question, if i do the string replacement outside of the find/while loop, how will i know the file name i need to change in the source code?, unless i save in a array or something all the files that were rename. — Leobardo Mora Castro, Oct 06 '16 at 19:33
@LeobardoMoraCastro, as I said: "This would probably involve dynamically creating a single sed or awk script during step 1 with which to perform the content edits on all files." Or yes, you could also store intermediate data in an array, scalar variable, or file and then build the needed script from that after exiting the loop. — John Bollinger, Oct 06 '16 at 19:43
@LeobardoMoraCastro in your original script you process _all_ the files with _each_ expression. Why bothering trying to process only the files which match? Better process _all_ the files once with the replacement set. — Jean-François Fabre, Oct 06 '16 at 20:14
You should use the switch `--no-run-if-empty` for `xargs`, otherwise the process might hang forever. — user1934428, Oct 07 '16 at 07:04

score 2 · Answer 1 · answered Oct 06 '16 at 17:44

2

Seems odd that a file would have ? in it. Website URLs have ? to indicate passing of parameters. wget from a website also doesn't guarantee you're getting the site, especially if server side execution takes place, like php files. So, I suspect as wget does its recursiveness, it's finding url's passing parameters and thus creating them for you.

To really get the site, you should have direct access to the files.

If I were you, I'd start over and not use wget.

You may also be having issues with files or directories with spaces in their name.

Instead of that line with xargs, you're already doing one file at a time, but grepping for all recursively. Just do the sed on the new file itself.

answered Oct 06 '16 at 17:44

strobelight

267
2
7

Hi, thanks for the answer but having direct access to the files is not an option, and yes, this is a website in jsp that has server side code execution so that's why is creating file with parameters. The problem with only using sed on the file that i already have is that i need to update all the files from the site that reference to the one i'm renaming, so that is why i have a recursive grep starting from the root again. – Leobardo Mora Castro Oct 06 '16 at 17:55
Although you present reasonable advice, it does not address the question(s) actually posed, which have to do with manipulating the names and contents of a collection of files already present on the machine (having been downloaded previously). – John Bollinger Oct 06 '16 at 19:15

score 1 · Accepted Answer · answered Oct 06 '16 at 19:25

Ok, here's the idea (untested):

in the first loop, just move the files and compose a global sed replacement file
once it is done, just scan all the files and apply sed with all the patterns at once, thus saving a lot of read/write operations which are likely to be the cause of the performance issue here
I would avoid to put the current script in the current directory or it will be processed by sed, so I suppose that all files to be processed are not in the current dir but in data directory

code:

sedfile=/tmp/tmp.sed
data=data
rm -f $sedfile
# locate ourselves in the subdir to preserve the naming logic
cd $data

# rename the files and compose the big sedfile

find . -type f -name '*\?*' | while read -r file ; do
 SUBSTRING=$(echo $file | rev | cut -d/ -f1 | rev)
 NEWSTRING=$(echo $SUBSTRING | sed 's/?/-/g')
 mv "$file" "${file//\?/-}"
 echo "s/$SUBSTRING/$NEWSTRING/g" >> $sedfile
done

# now apply the big sedfile once on all the files:    
# if you need to go recursive:
find . -type f  | xargs sed -i -f $sedfile
# if you don't:
sed -i -f $sedfile *

I'm glad it does. I wasn't able to fully test it. Excellent! — Jean-François Fabre, Oct 11 '16 at 17:01

score 0 · Answer 3 · edited May 23 '17 at 12:19

0

Instead of using grep, you can use the find command or ls command to list the files and then operate directly on them.

For example, you could do:

ls -1 /path/to/files/* | xargs sed -i '' "s/$SUBSTRING/$NEWSTRING/g"

Here's where I got the idea based on another question where grep took too long:

Linux - How to find files changed in last 12 hours without find command

edited May 23 '17 at 12:19

Community

1
1

answered Oct 09 '16 at 13:30

DomainsFeatured

1,426
1
21
39

find - grep taking too much time

3 Answers3