1

I have right now somthing like this. This function is part of Bash script file. Inside of this function I call many custom functions. Not do complicated. For example lenght just checks file name against string rules. Every function that I add, makes script much slower. Tested on 300 files. Simple find with just echo file_name, less then second. With all functions takes 0h:0m:11s. I know there is not enough info, but still, how can I make this faster.

On live i have do loop 20 miljon files.

function initDatabase {

    dir="$@"
    # check dir is not empty
    if [ ! -z $dir ]
    then
        find $dir -type f -print0 | while IFS= read -r -d '' FILE
        do
            error=0
            out=''

            #FUNCTION  validates file name
            out=$(lenght)

            if [ ! -z "$out" ]
            then 

                echo -e "${NC}${BLUE}Fail on vigane"
                echo -e "${RED}$out${NC}"
                echo "erro" >> $LOG_FILE_NAME
                echo "$out" >> $LOG_FILE_NAME
                error=1
            fi


            if [ $error == 0 ]
            then
                #get file name and directory
                f=${FILE##*/}
                f_dir="${FILE%/*}"
                changed=$(stat -c%Y $FILE)

                ## checks if file is pyramid tiff
                pyramid="false"
                out="$(multi $FILE)"

                if [ "$out" == 1 ]; then pyramid="true"; fi
                #FUNCTION removes zeros from beginning
                prop2=$(removeZeros "$(echo $f | cut -d'_' -f1 | cut -c4-)")
                #Get part count
                part_count=$(grep -o "_" <<<"$f" | wc -l)

            fi
        done
    else
        echo "ERROR:"
    fi
}
infinity
  • 13
  • 3
  • 2
    You have to be more specific. What should be the output from the function? Why do you run it at all? What for are variables like `part_count` or `prop2` calculated? – KamilCuk Dec 12 '19 at 11:16
  • Real function is realy bigger and all of this variables have meaning there. But even this function is slower then simple find. – infinity Dec 12 '19 at 11:31
  • IHMO There is not enough information in the question to provide specific answer. Consider sharing more information, or focus your question on speeding up a specific part that you can share. – dash-o Dec 12 '19 at 13:49
  • Pipe the output of your `find`command into **GNU Parallel** like this... https://stackoverflow.com/a/45032643/2836621 Be more explicit about what your functions and pyramid checks are - they can probably be improved too. – Mark Setchell Dec 12 '19 at 13:53

3 Answers3

0

You can fork and run in parallel on multiple files.

  • Can you be more specific – infinity Dec 12 '19 at 11:17
  • Yeah! Sorry about that. You could wrap your test in one big function. Afterwards, determine the level of parallelism your machine is capable of. Say you are capable of running n processes in parallel. Write a loop that will run (20 000 000/n) times, forking the test function you created n times. – user12524388 Dec 12 '19 at 11:23
0

The general rule is: the less you do, the faster it is. The less process you run, the better. Each [ is another process.

I could do:

length() {
    # rewrite to return nonzero on error
    sed '/^.\{15\}$/!q1'
}

# same wich multi
multi() {
    return 1
}

initDatabase() {
    # the `$@` in this context is the same as `$*`
    dir="$*"
    # quote your variables
    # Use bashs [[ instead of [
    if [[ -z "$dir" ]]; then
       echo "ERROR"
       return
    fi

    initDatabaseCallback() {
       # TODO: indent me properly
            # by convention, only exported names should be upper case
            local file
            file="$1"
            # remove (most probably) useless variable 
            if ! out=$(lenght); then
                # note to other programmers that these are global variables
                declare -g NC BLUE RED

                echo -e "${NC}${BLUE}Fail on vigane"
                echo -e "${RED}$out${NC}"
                echo "erro" >> $LOG_FILE_NAME
                echo "$out" >> $LOG_FILE_NAME

                # I guess this means something failed
                # see man xargs what to return here
                return 1
            fi
            # remove useless assignment and check

            f=${file##*/}
            f_dir="${file%/*}"
            # quote your variables
            changed=$(stat -c%Y "$file")

            ## checks if file is pyramid tiff
            # qoute your variables
            if multi "$file"; then
               pyramid=true
            else
               pyramid=false
            fi

            #FUNCTION removes zeros from beginning
            # you mean sed 's/^0*//'?
            # use bash here string instead of another process
            prop2=$(removeZeros "$(<<<"$f" cut -d'_' -f1 | cut -c4-)")
            #Get part count
            part_count=$(grep -o "_" <<<"$f" | wc -l)
    }
    export -f initDatabaseCallback

    # quote your variables
    find "$dir" -type f -print0 |
    # manipulate number of processes depending on your specific case
    xargs -0 -n1 -P$(nproc) bash -c 'initDatabaseCallback "$@"' --
}

Written that, I don't like the idea of so many variables and assignments. In my optinion shell the best works as a pipe - as collection of programs where one program takes the output from another, parses it and parses to another program. Most probably functions like length could be rewritten as a single sed script that works on a stream of filenames separated by newline. The stat could be integrated with find -printf saving one process. I guess grep -o | wc -l could be grep -c, but I don't know if lines or count matters.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
0

To say it very simple - the implementation is not very efficient. It will fork/exec multiple processes per file. Look at every command that is executed per file, and see if it can be implemented without external processes:

  • lenght
  • stat
  • multi
  • removeZeros
  • grep

If there is no internal bash command to perform the task, look at either batching.

If the above approach does not make processing more efficient, consider using a more flexible alternative that has stronger processing. Hard to recommend, since the post does not include information about the user defined functions (lenght, multi, ...)

dash-o
  • 13,723
  • 1
  • 10
  • 37