BASH: i have to loop 20 miljon files and Validate them

Question

I have right now somthing like this. This function is part of Bash script file. Inside of this function I call many custom functions. Not do complicated. For example lenght just checks file name against string rules. Every function that I add, makes script much slower. Tested on 300 files. Simple find with just echo file_name, less then second. With all functions takes 0h:0m:11s. I know there is not enough info, but still, how can I make this faster.

On live i have do loop 20 miljon files.

function initDatabase {

    dir="$@"
    # check dir is not empty
    if [ ! -z $dir ]
    then
        find $dir -type f -print0 | while IFS= read -r -d '' FILE
        do
            error=0
            out=''

            #FUNCTION  validates file name
            out=$(lenght)

            if [ ! -z "$out" ]
            then 

                echo -e "${NC}${BLUE}Fail on vigane"
                echo -e "${RED}$out${NC}"
                echo "erro" >> $LOG_FILE_NAME
                echo "$out" >> $LOG_FILE_NAME
                error=1
            fi


            if [ $error == 0 ]
            then
                #get file name and directory
                f=${FILE##*/}
                f_dir="${FILE%/*}"
                changed=$(stat -c%Y $FILE)

                ## checks if file is pyramid tiff
                pyramid="false"
                out="$(multi $FILE)"

                if [ "$out" == 1 ]; then pyramid="true"; fi
                #FUNCTION removes zeros from beginning
                prop2=$(removeZeros "$(echo $f | cut -d'_' -f1 | cut -c4-)")
                #Get part count
                part_count=$(grep -o "_" <<<"$f" | wc -l)

            fi
        done
    else
        echo "ERROR:"
    fi
}

You have to be more specific. What should be the output from the function? Why do you run it at all? What for are variables like `part_count` or `prop2` calculated? — KamilCuk, Dec 12 '19 at 11:16
Real function is realy bigger and all of this variables have meaning there. But even this function is slower then simple find. — infinity, Dec 12 '19 at 11:31
IHMO There is not enough information in the question to provide specific answer. Consider sharing more information, or focus your question on speeding up a specific part that you can share. — dash-o, Dec 12 '19 at 13:49
Pipe the output of your `find`command into **GNU Parallel** like this... https://stackoverflow.com/a/45032643/2836621 Be more explicit about what your functions and pyramid checks are - they can probably be improved too. — Mark Setchell, Dec 12 '19 at 13:53

score 0 · Answer 1 · answered Dec 12 '19 at 10:57

0

You can fork and run in parallel on multiple files.

answered Dec 12 '19 at 10:57

user12524388

86
1

Can you be more specific – infinity Dec 12 '19 at 11:17
Yeah! Sorry about that. You could wrap your test in one big function. Afterwards, determine the level of parallelism your machine is capable of. Say you are capable of running n processes in parallel. Write a loop that will run (20 000 000/n) times, forking the test function you created n times. – user12524388 Dec 12 '19 at 11:23

score 0 · Accepted Answer · answered Dec 12 '19 at 12:10

The general rule is: the less you do, the faster it is. The less process you run, the better. Each [ is another process.

I could do:

length() {
    # rewrite to return nonzero on error
    sed '/^.\{15\}$/!q1'
}

# same wich multi
multi() {
    return 1
}

initDatabase() {
    # the `$@` in this context is the same as `$*`
    dir="$*"
    # quote your variables
    # Use bashs [[ instead of [
    if [[ -z "$dir" ]]; then
       echo "ERROR"
       return
    fi

    initDatabaseCallback() {
       # TODO: indent me properly
            # by convention, only exported names should be upper case
            local file
            file="$1"
            # remove (most probably) useless variable 
            if ! out=$(lenght); then
                # note to other programmers that these are global variables
                declare -g NC BLUE RED

                echo -e "${NC}${BLUE}Fail on vigane"
                echo -e "${RED}$out${NC}"
                echo "erro" >> $LOG_FILE_NAME
                echo "$out" >> $LOG_FILE_NAME

                # I guess this means something failed
                # see man xargs what to return here
                return 1
            fi
            # remove useless assignment and check

            f=${file##*/}
            f_dir="${file%/*}"
            # quote your variables
            changed=$(stat -c%Y "$file")

            ## checks if file is pyramid tiff
            # qoute your variables
            if multi "$file"; then
               pyramid=true
            else
               pyramid=false
            fi

            #FUNCTION removes zeros from beginning
            # you mean sed 's/^0*//'?
            # use bash here string instead of another process
            prop2=$(removeZeros "$(<<<"$f" cut -d'_' -f1 | cut -c4-)")
            #Get part count
            part_count=$(grep -o "_" <<<"$f" | wc -l)
    }
    export -f initDatabaseCallback

    # quote your variables
    find "$dir" -type f -print0 |
    # manipulate number of processes depending on your specific case
    xargs -0 -n1 -P$(nproc) bash -c 'initDatabaseCallback "$@"' --
}

Written that, I don't like the idea of so many variables and assignments. In my optinion shell the best works as a pipe - as collection of programs where one program takes the output from another, parses it and parses to another program. Most probably functions like length could be rewritten as a single sed script that works on a stream of filenames separated by newline. The stat could be integrated with find -printf saving one process. I guess grep -o | wc -l could be grep -c, but I don't know if lines or count matters.

If `grep -o | wc -l` is counting only `_` occurences, them maybe just `tr -dc '_' | wc -c`. — KamilCuk, Dec 12 '19 at 13:22

score 0 · Answer 3 · answered Dec 12 '19 at 13:47

To say it very simple - the implementation is not very efficient. It will fork/exec multiple processes per file. Look at every command that is executed per file, and see if it can be implemented without external processes:

lenght
stat
multi
removeZeros
grep

If there is no internal bash command to perform the task, look at either batching.

If the above approach does not make processing more efficient, consider using a more flexible alternative that has stronger processing. Hard to recommend, since the post does not include information about the user defined functions (lenght, multi, ...)

BASH: i have to loop 20 miljon files and Validate them

3 Answers3