I would like to write a script which runs a command to OCR
pdfs, which deletes the resulting images, after the text files has been written.
The two commands I want to combine are the following.
This command create folders, extract pgm
from each PDF
and adds them into each folder:
time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'
This commands does the OCR and deletes the resulting images (pgm
):
time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
I would like to combine both commands so that the script deletes the pgm
images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.
So,
- Create folder
- Extract PGM from PDF
- OCR from PGM to txt
- Delete PGM images, which just have been used (missing)
Basically, I would like this 4 steps to be done in this order for each PDF
separated and not for all PDF
at once. How can I do this?
Edit:
My first attempt to solve my issues was to create the following command:
time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
However, tesseract would not find the language package.