4

sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in

AADDEEDDA

EEEEEEEE

AEEEDEEA

AEEEDDAAA

and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in

EDDDDDDDD

EEEDDDDDD

EEEEEEEED

I'm fighting with the proper regexp to do that. Here my last attempt

echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'

which does not work. Any help?

Thanks in advance.

3 Answers3

5

If i understand correctly your request a simple

awk '/^E+D+$/' file.input

will do the trick.

UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):

awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test
Giuseppe Ricupero
  • 6,134
  • 3
  • 23
  • 32
2

First of all, you need the regular expression:

^E+[^ED]*D+$

This matches one or more Es at the beginning, zero or more characters that are neither E nor D in the middle, and one or more Ds at the end.

Then your AWK program will look like

$2 ~ /^E+[^ED]*D+$/

$2 refers to the 2nd field of the current record, ~ is the regex matching operator, and /s delimit a regular expression. Together, these components form what is known in AWK jargon as a "pattern", which amounts to a boolean filter for input records. Note that there is no "action" (a series of statements in {s) specified here. That's because when no action is specified, AWK assumes that the action should be { print $0 }, which prints the entire line.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
1

If I understand you correct you want to match patterns that starts with at least one E and then continues with at least one D until the end.

echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E+D+$) print $0}'
Roger Lindsjö
  • 11,330
  • 1
  • 42
  • 53