Modify gupdatedb (GNU updatedb command) to insert parallel command

Question

I am working on MacOS 10.15 with the tool glocate and gupdatedb from findutils package installed with brew.

I would like to integrate the shell command "parallel" into the script gupdatedb into order to build more fastly the database.

In the original version of script gupdatedb command, I get :

: ${find:=${BINDIR}/gfind}

1) I tried to insert the parallel command in this command above.

Usually, with gfind, we can use parallel command like this :

parallel --lb -j32 gfind ::: /*

the option '/*' is used to find all files from root directory and all its subdiretories.

So I tried to do (for the gupdatedb script) :

: ${find:=/usr/local/bin/parallel -j32 ${BINDIR}/gfind}

But at the execution, I get the following error and I can't explain it :

updatedb needs to be able to execute -j32, but cannot.

2) I tried also to pass by variable :

    num_threads=-j32
    ${parallel:=${BINDIR}/parallel --lb $num_threads}
    : ${find:=${parallel} ${BINDIR}/gfind \{\} ::: }
    : ${frcode:=${LIBEXECDIR}/gfrcode}

But the code remains locked and database is not generated.

How can I overcome this issue to be able to execute gfind on multiple threads (here 8 threads) ?

PS1 : in this post, I make reference to another link : parallel with find explaining how to combine find and parallel commands.

PS2 : the script gupdatedb is relatively long, so I give below relevant sections, at least I think (I stopped the program hanging with CMD+C) :

# The database file to build.
: ${LOCATE_DB=/usr/local/var/locate/locatedb}

# Directory to hold intermediate files.
if test -z "$TMPDIR"; then
  if test -d /var/tmp; then
    : ${TMPDIR=/var/tmp}
  elif test -d /usr/tmp; then
    : ${TMPDIR=/usr/tmp}
  else
    : ${TMPDIR=/tmp}
  fi
fi
export TMPDIR

# The user to search network directories as.
: ${NETUSER=daemon}

# The directory containing the subprograms.
if test -n "$LIBEXECDIR" ; then
    : LIBEXECDIR already set, do nothing
else
    : ${LIBEXECDIR=/usr/local/Cellar/findutils/4.7.0/libexec}
fi

# The directory containing find.
if test -n "$BINDIR" ; then
    : BINDIR already set, do nothing
else
    : ${BINDIR=/usr/local/bin}
fi

# DEV : parallel prefix command
num_threads=-j32
${parallel:=${BINDIR}/parallel --lb $num_threads}
# The names of the utilities to run to build the database.
: ${find:=${parallel} ${BINDIR}/gfind \{\} ::: }
: ${frcode:=${LIBEXECDIR}/gfrcode}

UPDATE 1: From my results, If I comment the line # checkbinary $binary and if I apply my second method (see 2) I tried...), I get the following error message (I have activated set -x for debug :

+ version='
updatedb (GNU findutils) 4.7.0
Copyright (C) 1994-2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Eric B. Decker, James Youngman, and Kevin Dalley.
'
+ LC_ALL=C
+ export LC_ALL
+ usage='Usage: /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb [--findoptions='\''-option1 -option2...'\'']
       [--localpaths='\''dir1 dir2...'\''] [--netpaths='\''dir1 dir2...'\'']
       [--prunepaths='\''dir1 dir2...'\''] [--prunefs='\''fs1 fs2...'\'']
       [--output=dbfile] [--netuser=user] [--localuser=user]
       [--dbformat] [--version] [--help]

Please see also the documentation at http://www.gnu.org/software/findutils/.
Report (and track progress on fixing) bugs in the updatedb
program via the GNU findutils bug-reporting page at
https://savannah.gnu.org/bugs/?group=findutils or, if
you have no web access, by sending email to <bug-findutils@gnu.org>.
'
+ changeto=/
+ frcode_options=
+ case "$dbformat" in
+ true
+ sort='/usr/bin/sort -z'
+ print_option=-print0
+ frcode_options=' -0'
+ :
+ : /usr/local/bin/zsh
+ : /
+ :
+ : '
/afs
/amd
/proc
/sfs
/tmp
/usr/tmp
/var/tmp
'
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ test -z ''
++ echo /afs /amd /proc /sfs /tmp /usr/tmp /var/tmp
++ sed -e 's,^,\\(^,' -e 's, ,$\\)\\|\\(^,g' -e 's,$,$\\),'
+ PRUNEREGEX='\(^/afs$\)\|\(^/amd$\)\|\(^/proc$\)\|\(^/sfs$\)\|\(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)'
+ : /usr/local/var/locate/locatedb
+ test -z ''
+ test -d /var/tmp
+ : /var/tmp
+ export TMPDIR
+ : daemon
+ test -n ''
+ : /usr/local/Cellar/findutils/4.7.0/libexec
+ test -n ''
+ : /usr/local/bin
+ num_threads=-j32
+ /usr/local/bin/parallel --lb -j32
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2020, July 22). GNU Parallel 20200722 ('Privacy Shield').
  Zenodo. https://doi.org/10.5281/zenodo.3956817

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#Citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 15 times. Isn't it about time
you run 'parallel --citation' once to silence the citation notice?

parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot
parallel: Warning: ::: or :::: or -a or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.
^C+ : /usr/local/bin/parallel --lb -j32 /usr/local/bin/gfind '{}' :::
+ : /usr/local/Cellar/findutils/4.7.0/libexec/gfrcode
+ : '
9P
NFS
afs
autofs
cifs
coda
devfs
devpts
ftpfs
iso9660
mfs
ncpfs
nfs
nfs4
proc
shfs
smbfs
sysfs
'
+ test -n '
9P
NFS
afs
autofs
cifs
coda
devfs
devpts
ftpfs
iso9660
mfs
ncpfs
nfs
nfs4
proc
shfs
smbfs
sysfs
'
++ echo 9P NFS afs autofs cifs coda devfs devpts ftpfs iso9660 mfs ncpfs nfs nfs4 proc shfs smbfs sysfs
++ sed -e 's/\([^ ][^ ]*\)/-o -fstype \1/g' -e 's/-o //' -e 's/$/ -o/'
+ prunefs_exp='-fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o'
+ rm -f /usr/local/var/locate/locatedb.n
+ trap 'rm -f $LOCATE_DB.n; exit' HUP TERM
+ cd /
+ test -n /
+ '[' '' '!=' '' ']'
+ /usr/bin/sort -z
+ /usr/local/Cellar/findutils/4.7.0/libexec/gfrcode -0
+ : OK so far
+ true
+ test -s /usr/local/var/locate/locatedb.n
+ chmod 644 /usr/local/var/locate/locatedb.n
+ mv /usr/local/var/locate/locatedb.n /usr/local/var/locate/locatedb
+ exit 0

UPDATE 2:

@MarkStechell. I simply do a sudo gupdatedb in a directory.

Could you give please the full command to apply : you suggested me parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS but this doesn't seem to work.

What I have tried is : parallel -j32 --lb find {} $FINDOPTIONS * ::: */* but after a while, I get the following error : gfind: failed to read file names from file system at or below '/': No such file or directory :

I would like to index all files from main root / but / and /System/Volume/Data/ are duplicated.

UPDATE 3: if the number of subdiretories is lower than the number of threads I use when I launch with parallel -j32 ..., is there a way to indicate to the parallel command to explore all the sub-sub etc sub-sub etc directories ?

It seems that make -j32 has this kind of behavior (maybe I am wrong) but this is very interesting to not have only one single process on a subdirectory whereas this subdirectory could contain a lot of number of sub-sub directories to explore and then benefit from all 32 processes launched by parallel -j32 .... Then, this would avoid wasting time to not parallelize all these sub-sub directories or even deeper.

UPDATE 4: I don't know what to do in the command suggested by @MarkSetchell ; for example, if I have 3 subdirectories in current directory :

# : A2
parallel -j 32 --lb  gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS

especially, what to put for BUNCH_OF_PATHS ?

Have I got to put for this the option --localpaths dir1/ dir2/ dir3/ instead of BUNCH_OF_PATHS ? and what about the terms $FINDOPTIONS ... with the 3 dots ?

This doesn't make sense. You normally only update the database in the background once a week, or once a day at night-time when the system is quiet, so it doesn't matter whether it takes 2 minutes or 15 minutes. There is no point using `--lb` because that is for interactive use and it makes the individual results come out sooner but it makes the job slower as a whole. — Mark Setchell, Aug 05 '20 at 14:23
Also, if you divide up the work by directory, parallelisation will not work very well unless the directories have roughly equal numbers of files so even if you start 32 processes, you will find `/etc` and `/var` and a load of small directories will get down in 2 seconds each and 31 of your processes will finish and one single process will be left analysing the 10,000,000 files under `/Users/YOU` for the next 2 hours. — Mark Setchell, Aug 05 '20 at 14:23
In answer to your latest questions, you need to run `man gupdatedb` and read how it works. You will see it doesn't matter which directory you start it in, but you can pass `--localpaths` to choose some directories and you will also see how you can exclude certain paths, such as `/Volumes` with `-prunepaths`. — Mark Setchell, Aug 05 '20 at 14:27
I understand the method does not systematically make sense but could you please give me with `--localpaths` (which could be `/`or `/Users/me`) and `-prunepaths` options, the command line concerned, which is by default : ` `$find $SEARCHPATHS $FINDOPTIONS \ $ $prunefs_exp \ -type d -regex "$PRUNEREGEX" $ -prune -o $print_option` ? — , Aug 05 '20 at 15:04
@MarkSetchell Have you been able to see my **UPDATE 4** where I am stuck with the setting syntax into `parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS`, especially to know what to set for `$FINDOPTIONS` and `BUNCH_OF_PATHS`, I am a little lost, a simple example would be great. — , Aug 13 '20 at 02:22

Mark Setchell · Accepted Answer · 2020-08-04T14:36:40.913

2

Updated Answer

The problem is on the line after the line containing A2 in the file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb. Currently, it is of the form:

# : A2
$find $SEARCHPATHS $FINDOPTIONS \( $prunefs_exp  -type d -regex "$PRUNEREGEX" \) -prune -o $print_option

whereas you want it to be of the form:

# : A2
parallel -j 32 --lb  gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS

As you haven't given the paths you wish to search in parallel, the paths at the moment are just / which means nothing can be done in parallel. You will need to run with --localpaths set to a bunch of places that are worth searching parallel or hack the script even more extensively. Though, to be honest, I am not sure why you would want to speed this up because it should only be run relatively rarely and then only at times when the system is quiet.

Original Answer

Go to around line 250 of file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb and comment it out with a hash sign so it looks like this:

for binary in $find $frcode
do
  #checkbinary $binary
done

edited Aug 04 '20 at 14:36

answered Aug 04 '20 at 13:06

Mark Setchell

191,897
31
273
432

Thanks for your quick answer. Unfortunately, I have already done that and this doesn'( work. See my **UPDATE1** for more details – Aug 04 '20 at 13:15
Huh? The word `UPDATE1` doesn't appear in your question. Nor does the word `checkbinary`. – Mark Setchell Aug 04 '20 at 13:19
Sorry, I was busy, I have just put the **UPDATE 1**. – Aug 04 '20 at 13:34
What is the exact command, with all the options you specify, that you are using to run `gupdatedb` please? – Mark Setchell Aug 04 '20 at 13:56
@MarkStechell. I simply do a `sudo gupdatedb` in a directory. Could you give please the full command to apply : you suggested me `parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS` but this doesn't seem to work. What I have tried is : `parallel -j32 --lb find {} $FINDOPTIONS * ::: */*` but after a while, I get the following error : `gfind: failed to read file names from file system at or below '/': No such file or directory` : what can I do now ? – Aug 05 '20 at 13:52
Hi ! did you see my last **UPDATE4** ? Indeed, I don't know where to set the correct options ("classical" options of `find` command and other options with what you call `BUNCH_OF_PATHS`, producing then your command line `parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS`? Could you be more explicit/practical about these different options ? a simple example would be fine ! Regards – Aug 09 '20 at 21:46

Modify gupdatedb (GNU updatedb command) to insert parallel command

1 Answers1

Linked