RegEx skip word

Question

I would like to use regular expressions to extract the first couple of words and the second to last letter of a string.
For example, in the string

                       "CSC 101 Intro to Computing  A  R"

I would like to capture

                        "CSC 101 A"

Maybe something similar to this

                 grep -o -P '\w{3}\s\d{3}*thenIdon'tKnow*\s\w\s'

Any help would be greatly appreciated.

score 1 · Answer 1 · answered Oct 24 '16 at 08:00

1

You could go for:

^((?:\w+\W+){2}).*(\w+)\W+\w+$

And use group 1 + 2, see it working on regex101.com.

Broken down, this says:

^                 # match the start of the line/string
(                 # capture group 1
    (?:\w+\W+){2} # repeated non-capturing group with words/non words
)
.*                # anything else afterwards
(\w+)\W+\w+       # backtracking to the second last word character
$

answered Oct 24 '16 at 08:00

Jan

42,290
8
54
79

Thank you for the help. The regular expression works but how do I only print group 1 & 2? – IBWEV Oct 24 '16 at 12:55
@IBWEV: Which programming language are you using? – Jan Oct 24 '16 at 12:56
I am using pcregrep, grep, and possibly sed of the GNU utilities. – IBWEV Oct 24 '16 at 13:38

score 0 · Answer 2 · edited Sep 24 '17 at 19:58

0

A whole RegEx pattern can't match disjointed groups.

I suggest taking a look at Capture Groups - basically you capture the two disjointed groups, the matched couples of words can then be used by referring to these two groups.

grep can't print out multiple capture groups so an example with sed is
echo 'CSC 101 Intro to Computing A R' | sed -n 's/^$\w\{3\}\s[[:digit:]]\{3\}$.*\?$\w$\s\+\w$/\1 \2/p' which prints out CSC 101 A
Note that the pattern used here is ^(\w{3}\s\d{3}).*?(\w)\s+\w$

edited Sep 24 '17 at 19:58

Graham

7,431
18
59
84

answered Oct 24 '16 at 05:41

Phu Ngo

866
11
21

Thank you for the help. The command you have written works nicely at the command prompt. However, it is not working in my script – IBWEV Oct 24 '16 at 12:58
pcregrep -M 'LSUS(\n|.)*?of' $f | grep -P '(\w{3}|\w{4})\s\d{3}\s.{23}\s+[ABCD]' | grep -o -P '(?<=).{40}(?=)' | sed -n 's/^$\w\{3\}\s[[:digit:]]\{3\}$.*\?\w\s\+$\w$$/\1 \2/p' – IBWEV Oct 24 '16 at 12:59
What does `pcregrep -M 'LSUS(\n|.)*?of' $f | grep -P '(\w{3}|\w{4})\s\d{3}\s.{23}\s+[ABCD]' | grep -o -P '(?<=).{40}(?=)'` print out? You may want to modify the sed part to `sed -n 's/^$\w\{3,4\}\s[[:digit:]]\{3\}$.*\?$\w$$/\1 \2/p'` to match your format – Phu Ngo Oct 24 '16 at 13:19
It prints out "CSC 101 Intro to Computing A R" . I would like to capture "CSC 101 A" . I am close to understanding the regular expression you have following sed but what does "\?" do? – IBWEV Oct 24 '16 at 13:35
I got it because of your help. sed -n 's/^$\w\{3,4\}\s[[:digit:]]\{3\}$.*$\w$/\1 \2/p' Thank you. – IBWEV Oct 24 '16 at 13:40

heemayl · Answer 3 · 2016-10-24T13:17:31.000

0

Do:

^(\S+)\s+(\S+).*(\S+)\s+\S+$

The 3 captured groups capture the 3 desired potions
\S indicates any non-whitespace character
\s indicates any whitespace character

Demo

As you have used grep with PCRE in your example, i am assuming you have access to the GNU toolset. Using GNU sed:

% sed -E 's/^(\S+)\s+(\S+).*(\S+)\s+\S+$/\1 \2 \3/' <<<"CSC 101 Intro to Computing  A  R"
CSC 101 A

edited Oct 24 '16 at 13:17

answered Oct 24 '16 at 05:45

heemayl

39,294
7
70
76

How do I only print the captured groups? – IBWEV Oct 24 '16 at 13:00

RegEx skip word

3 Answers3

Linked