4

I want to use grep and a regular expression to search a text document. When I type in this:

grep -o ((D|d)ie|(D|d)as|(D|d)e(r|n|m|s)|(ei|Ei)(n|ne|nen|nem|ner|nes)) [A-ZÄÖÜ][A-Za-zäöü]* document.txt

I get this:

-bash: syntax error near unexpected token `('

I already tried to put the regular expression in quotation marks. By doing this, I don't get an error, but I don't find anything either. Thank you for helping me.

For example, the following sentence is in my document:

Der Mann und die Frau haben ein Haus.

I want to extract:

Der Mann
die Frau
ein Haus
fedorqui
  • 275,237
  • 103
  • 548
  • 598
bogdan
  • 97
  • 1
  • 7
  • yes, quotes are needed to avoid shell interpretation... as for not finding anything, add sample input and expected output for that along with an explanation of what you are trying to achieve – Sundeep May 12 '17 at 10:45
  • one probable issue is that you are trying to use ERE, in which case you need to use `grep -oE` – Sundeep May 12 '17 at 10:45
  • If you are using the correct locale, you can simply use `'[[:upper:]][[:alpha:]]*'` instead of your explicit bracket expressions. (I assume the including `äöü` without the uppercase counterparts in the second one is an oversight.) – chepner May 12 '17 at 13:10

1 Answers1

5

Put the pattern in single quotes and enable Extended Regular Expression support with -E.

grep -Eo '((D|d)ie|(D|d)as|(D|d)e(r|n|m|s)|(ei|Ei)(n|ne|nen|nem|ner|nes)) [A-ZÄÖÜ][A-Za-zäöü]*' document.txt

Bear in mind that (D|d) can be written more simply in a bracket expression [Dd]. The same applies for the other parts of your regular expression, where you are OR-ing single characters.

As mentioned in the comments, another option to consider is the -i option, which means that the case of the characters is ignored entirely.

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • 3
    Or simply using `-i` to directly ignore case. – fedorqui May 12 '17 at 10:57
  • 1
    Regex could be condensed to `(([Dd](ie|as|e[rnms]))|[Ee]in(|e[nmrs]?)) [A-ZÄÖÜ][A-Za-zäöü]*` – 123 May 12 '17 at 11:15
  • 2
    Especially if you use `-i`, it's far simpler to use a longer regular expression. Minimizing its length isn't really going to gain you anything over `(die|das|der|den|dem|des|ein|eine|einen|einem|einer|eines)`. – chepner May 12 '17 at 13:17