grep content from pdf file and write it partwise into variables

Question

Introducing myself as I'm just started to join stack overflow after searching around for some days. I'm working on a little project with my RasPi sorting out my PDF documents with speaking filenames.

I'm going to grep with pdfgrep the companyname and the date from various documents.

Here is the code:

#!/bin/bash

# set work directory
workpath=~pi/Documents/

find $workpath/ -iname '*.pdf' -print | while read FILENAME
do
        if pdfgrep -i --max-count 1 'company1' "${FILENAME}";
        then
                echo "$FILENAME";
                pdfgrep  --max-count 1 '(([0-9][0-9]{,1}\.)\s+('Januar'|'Februar'|'März'|'April'|'Mai'|'Juni'|'Juli'|'August'|'September'|'Oktober'|'November'|'Dezember')\s+([1-9][0-9][0-9][0-9]{1,}))' "${FILENAME}";
                echo "company1";
       elif pdfgrep -i --max-count 1 'company2' "${FILENAME}";
       then
               echo "$FILENAME";
               pdfgrep  --max-count 1 '('Datum:')\s+(([0-9][0-9]{,1}\.)([0-9][0-9]{,1}\.)([1-9][0-9][0-9][0-9]{1,}))'
               echo "company2";
        else
                echo "$FILENAME";
                echo "undefined document -- Error!!";
        fi
done

For each file I get different content as:

companyname

paper of conduct companyname

companyname and companyaddress

and more different stuff

The date comes also different

dd.mm.yyyy

date: dd.mm.yyyy

some text dd. month yyyy

_______________________dd.month yyyy

I'm looking for a way to write only the needed content, without text around, into variables as:

comp=companyname

datey=yyyy

datem=mm / here I need also an idea how to translate month to mm

dated=dd

result should be: yyyymmdd-companyname.pdf

I started with bash scripting, as this is I get pdfgrep working and I'm not quite familar with programming languages. Maybe I did some lines in python :S

Your help will be very welcome!

cheers, bdream

score 1 · Answer 1 · answered Mar 18 '20 at 16:46

This is not a full solution but a list of hints.

Adding option -o to the pdfgrep command should print only the matching part of the line, i.e. eliminate additional text like "date:" etc.

pdfgrep -o --max-count 1 '(([0-9][0-9]{,1}\.)\s+('Januar'|'Februar'|'März'|'April'|'Mai'|'Juni'|'Juli'|'August'|'September'|'Oktober'|'November'|'Dezember')\s+([1-9][0-9][0-9][0-9]{1,}))' "${FILENAME}";

Since you search for specific company names in

if pdfgrep -i --max-count 1 'company1' "${FILENAME}";

etc you don't really need the output, you can use your known company name instead. You can add option -q to suppress the output

if pdfgrep -q -i --max-count 1 'company1' "${FILENAME}";

So the remaining task is to parse various date/time formats which can be done using strptime function available in Python or Perl or using the Python dateutil library. See Parsing a date that can be in several formats in python

grep content from pdf file and write it partwise into variables

1 Answers1