1

Though I'm not totally new to regExp, they always give me headaches. Especially when not all forms of regular expressions can be used.

  • The pattern has to work with pdfgrep as the information I'm looking for is inside a pdf Document.
  • Obviously the document is multiline
  • The resulting pattern will be used in a bash script if this does make any difference
  • The keywords usually can be found more than once in the same file, while I need only the data between the first occurences of both keywords

The data looks like:

some text
some more text
even more information Date
                      02.Feb.2014
                      Customer
some more text
some more information
even more information Date
                      02.Feb.2014
                      Customer
some more text
some more information
...

The result of the command should be: 02.Feb.2014

I don't know which characters might be around this date (tabs, spaces ...) and I don't want to rely on them.

I tried

pdfgrep -h 'Date(.*?)Customer' *.pdf

which gave no result at all.

Next try was

pdfgrep -h '(?<=Date)(.*)(?=Customer)' *.pdf

which resulted in an error "Invalid preceding regular expression"

The best shot I can come up until now is

pdfgrep -h '(Date)[[:space:]]{,1}.{,100}[[:space:]](Customer){,1}' *.pdf

This returns all matching dates together with the first keyword. But I'd like a much more elegant way as regExp should be able to provide it.

I'd appreciate any useful hint ;)

Regards

Manuel

2 Answers2

0

The only document you should ever read when using grep, awk, or sed regular expressions is here. It cleared a lot of stuff up for me.

sed -n -e '/even more information Date/ {' \
       -e '    n' \
       -e '    s/^[[:space:]]*//' \
       -e '    p' \
       -e '}'

UNIX regular expressions only look at lines in the file. you can't capture stuff in an RE across lines.

The above sed command looks for a line looking like even more information Date, looks at the next line, removes the white space, and prints that line (the one with 02.Feb.2014 on it). The -n option is used to suppress output (only print lines if "I tell you to", sed).

djhaskin987
  • 9,741
  • 4
  • 50
  • 86
  • So I have to extract the whole PDF with pdfgrep and then I can pipe it through sed? Ok, I'll try this but it doesn't seem to be very efficient. I tried your code, which I had to adjust a bit, but it just finds 'Date' and nothing else. The command I used was: pdfgrep -h '.' *.pdf | sed -n -e '/Date/ {' -e 'n' -e 's/^[[:space:]]*//' -e 'p' -e '}' – user3848598 Jul 22 '14 at 11:01
  • Honestly, if it was me, I'd convert the PDF to text and then use python or ruby to get what I needed. See this answer: http://stackoverflow.com/questions/6451626/how-do-i-convert-a-pdf-to-text-so-i-can-parse-that-text-with-php – djhaskin987 Jul 22 '14 at 20:42
  • XPDF won't work though. this server has no GUI and will never have. – user3848598 Jul 28 '14 at 09:43
  • It looks like the ghostscript extraction method is pretty good. -- http://stackoverflow.com/questions/6187250/pdf-text-extraction – djhaskin987 Jul 28 '14 at 15:53
0

The hint to use gs in combination with sed does the trick. Though I had to do some testing until it worked as desired.

The command used now is:

gs -q -dBATCH -dNOPAUSE -sDEVICE=txtwrite -dFirstPate=1 -dLastPage=1 \
      -sOutputFile=- /path/to/my.pdf 2>/dev/null | sed -n -e '/Date/ {' \
      -e'n' -e's/^[[:space:]]*//' -e 'p' -e '}'

Thanks to all contributors :)