1

I would like to write a script which runs a command to OCR pdfs, which deletes the resulting images, after the text files has been written.

The two commands I want to combine are the following.

This command create folders, extract pgm from each PDF and adds them into each folder:

time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'

This commands does the OCR and deletes the resulting images (pgm):

time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'

I would like to combine both commands so that the script deletes the pgm images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.

So,

  1. Create folder
  2. Extract PGM from PDF
  3. OCR from PGM to txt
  4. Delete PGM images, which just have been used (missing)

Basically, I would like this 4 steps to be done in this order for each PDF separated and not for all PDF at once. How can I do this?

Edit:

My first attempt to solve my issues was to create the following command:

time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'

However, tesseract would not find the language package.

Til Hund
  • 1,543
  • 5
  • 21
  • 37

1 Answers1

2

Updated Answer

I have not tested this please run it on a copy of a small subset of your files. You can turn off the messages with DEBUG: at the start if you are happy it looks good:

#!/bin/bash

# Declare a function for "parallel" to call
doit() {
    # Get name of PDF with and without extension
    withext="$1"
    noext="$2"
    echo "DEBUG: Processing $withext into $noext"

    # Make output directory
    mkdir -p "$noext"

    # Extract as PGM into subdirectory
    gs ... -o "$noext"/"${noext}-%03d.pgm $withext"

    # Go to target directory or die with error message
    cd "$noext" || { echo ERROR: Failed to cd to $noext ; exit 1; }

    # OCR and remove each PGM 
    n=0
    for f in *pgm; do
       echo "DEBUG: OCR $f into $n"
       tesseract "$f" "$n" -l deu_frak
       echo "DEBUG: Remove $f"
       rm "$f"
       ((n=n+1))
    done 
}

# Ensure the function is exported to subshells
export -f doit

find . -name \*.pdf -print0 | parallel -0 doit {} {.}

You should be able to test the doit() function without parallel by running:

doit someFile.pdf someFile

Original Answer

If you want to do lots of things for each argument in GNU Parallel, the simplest way is to declare a bash function and then call that.

It looks like this:

# Declare a function for "parallel" to call
doit() {
    echo "$1" "$2"
    # mkdir something
    # extract PGM
    # do OCR
    # delete PGM
}

# Ensure the function is exported to subshells
export -f doit

find some files -print0 | parallel -0 doit {} {.}
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • 2
    Please quote $1 and $2 - otherwise your use of -print0 does not really make a difference: `echo "$1" "$2"`. – Ole Tange Jul 11 '17 at 13:07
  • 1
    Added bonus by using a function: It is very easy to test on a single file. – Ole Tange Jul 11 '17 at 13:09
  • Hi Mark Setchell, thank on proposing a solution. It looks good! Unfortunately, I am not able to make it work by just copy 'n' pasting my commands in to your script. Note that I am a beginner in such things. Would you be so kind as to create a workable solution with my commands? This would be fantastic! Imagine you would have the 'PDF' in a folder called **test**. How would you proceed with the script? – Til Hund Jul 11 '17 at 14:25
  • 1
    I have had my best shot at a fuller version... please try gently and carefully :-) – Mark Setchell Jul 11 '17 at 16:03
  • Is there an easier way? My main problem is that I cannot pipe input files with two different file extensions to **GNU Parallel**, like `ls *pdf *pgm` because `tesseract` will stop with an error message that it cannot open `pdfs`. Why I cannot tell parallel to look for other files in the middle of the command, like the one in my edit above, where I start with `find . -name \*.pdf` and change to `find . -name \*.pgm`? – Til Hund Jul 11 '17 at 21:27