0

I would like to use regular expressions to extract the first couple of words and the second to last letter of a string.
For example, in the string

                       "CSC 101 Intro to Computing  A  R"

I would like to capture

                        "CSC 101 A"

Maybe something similar to this

                 grep -o -P '\w{3}\s\d{3}*thenIdon'tKnow*\s\w\s'

Any help would be greatly appreciated.

Jan
  • 42,290
  • 8
  • 54
  • 79
IBWEV
  • 11
  • 1
  • 6

3 Answers3

1

You could go for:

^((?:\w+\W+){2}).*(\w+)\W+\w+$

And use group 1 + 2, see it working on regex101.com.


Broken down, this says:
^                 # match the start of the line/string
(                 # capture group 1
    (?:\w+\W+){2} # repeated non-capturing group with words/non words
)
.*                # anything else afterwards
(\w+)\W+\w+       # backtracking to the second last word character
$
Jan
  • 42,290
  • 8
  • 54
  • 79
0

A whole RegEx pattern can't match disjointed groups.

I suggest taking a look at Capture Groups - basically you capture the two disjointed groups, the matched couples of words can then be used by referring to these two groups.

grep can't print out multiple capture groups so an example with sed is
echo 'CSC 101 Intro to Computing A R' | sed -n 's/^\(\w\{3\}\s[[:digit:]]\{3\}\).*\?\(\w\)\s\+\w$/\1 \2/p' which prints out CSC 101 A
Note that the pattern used here is ^(\w{3}\s\d{3}).*?(\w)\s+\w$

Graham
  • 7,431
  • 18
  • 59
  • 84
Phu Ngo
  • 866
  • 11
  • 21
  • Thank you for the help. The command you have written works nicely at the command prompt. However, it is not working in my script – IBWEV Oct 24 '16 at 12:58
  • pcregrep -M 'LSUS(\n|.)*?of' $f | grep -P '(\w{3}|\w{4})\s\d{3}\s.{23}\s+[ABCD]' | grep -o -P '(?<=).{40}(?=)' | sed -n 's/^\(\w\{3\}\s[[:digit:]]\{3\}\).*\?\w\s\+\(\w\)$/\1 \2/p' – IBWEV Oct 24 '16 at 12:59
  • What does `pcregrep -M 'LSUS(\n|.)*?of' $f | grep -P '(\w{3}|\w{4})\s\d{3}\s.{23}\s+[ABCD]' | grep -o -P '(?<=).{40}(?=)'` print out? You may want to modify the sed part to `sed -n 's/^\(\w\{3,4\}\s[[:digit:]]\{3\}\).*\?\(\w\)$/\1 \2/p'` to match your format – Phu Ngo Oct 24 '16 at 13:19
  • It prints out "CSC 101 Intro to Computing A R" . I would like to capture "CSC 101 A" . I am close to understanding the regular expression you have following sed but what does "\?" do? – IBWEV Oct 24 '16 at 13:35
  • I got it because of your help. sed -n 's/^\(\w\{3,4\}\s[[:digit:]]\{3\}\).*\(\w\)/\1 \2/p' Thank you. – IBWEV Oct 24 '16 at 13:40
0

Do:

^(\S+)\s+(\S+).*(\S+)\s+\S+$
  • The 3 captured groups capture the 3 desired potions

  • \S indicates any non-whitespace character

  • \s indicates any whitespace character

Demo


As you have used grep with PCRE in your example, i am assuming you have access to the GNU toolset. Using GNU sed:

% sed -E 's/^(\S+)\s+(\S+).*(\S+)\s+\S+$/\1 \2 \3/' <<<"CSC 101 Intro to Computing  A  R"
CSC 101 A
heemayl
  • 39,294
  • 7
  • 70
  • 76