Though I'm not totally new to regExp, they always give me headaches. Especially when not all forms of regular expressions can be used.
- The pattern has to work with pdfgrep as the information I'm looking for is inside a pdf Document.
- Obviously the document is multiline
- The resulting pattern will be used in a bash script if this does make any difference
- The keywords usually can be found more than once in the same file, while I need only the data between the first occurences of both keywords
The data looks like:
some text
some more text
even more information Date
02.Feb.2014
Customer
some more text
some more information
even more information Date
02.Feb.2014
Customer
some more text
some more information
...
The result of the command should be: 02.Feb.2014
I don't know which characters might be around this date (tabs, spaces ...) and I don't want to rely on them.
I tried
pdfgrep -h 'Date(.*?)Customer' *.pdf
which gave no result at all.
Next try was
pdfgrep -h '(?<=Date)(.*)(?=Customer)' *.pdf
which resulted in an error "Invalid preceding regular expression"
The best shot I can come up until now is
pdfgrep -h '(Date)[[:space:]]{,1}.{,100}[[:space:]](Customer){,1}' *.pdf
This returns all matching dates together with the first keyword. But I'd like a much more elegant way as regExp should be able to provide it.
I'd appreciate any useful hint ;)
Regards
Manuel