Regular Expression to Extract Text Bounded by '/'

Question

I need to a regular expression to extract names from a GEDCOM file. The format is:

Fred Joseph /Smith/

Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.

This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:

As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.

Any help would be much appreciated.

Niitaku · Accepted Answer · 2017-02-18T10:25:56.407

For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.

Here is my proposed solution. It doesn't capture spaces around forenames.

^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$

To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.

See the demo

Explanation

^ start of the line
(?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
- \h* 0 or more horizontal spaces
- ([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
- \h* matches the possible remaining spaces without capturing
(?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
(?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
$ end of the line

Wow, thanks Niitaku. No wonder I couldn't figure this out. Appreciate the clear explanation too. Thanks. — Magic Bullet Dave, Feb 18 '17 at 10:24

score 0 · Answer 2 · answered Feb 18 '17 at 09:18

0

For your requirements

([A-z a-z /])+\w*

Sample

answered Feb 18 '17 at 09:18

Sandeep Bhaskar

300
2
12

Thanks for the quick response Sandeep. Doesn't seem to work for me. Added a \ before the / but still didn't capture as expected. – Magic Bullet Dave Feb 18 '17 at 09:25

grail · Answer 3 · 2017-02-18T20:21:40.120

0

I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':

(.*)(\/?.*\/?)(.*)

Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group

Edit:

Extending on Niitaku solution and looking at having each individual name in its own group, you could use:

^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$

As explained though, if using a language like ruby it would simply be:

ruby -pe '$_ = $_.scan(/\w+/)' file

edited Feb 18 '17 at 20:21

answered Feb 18 '17 at 09:29

grail

914
6
14

Thanks grail. When I do that the first capture group captures everything including, i.e., Fred Joseph /Smith/. I am using NSRegularExpression, but testing using https://regex101.com with the pcre flavour. – Magic Bullet Dave Feb 18 '17 at 09:38
Something I am confused on, what is the desired output? Should we be capturing the '/' as part of a group or do you just want the names? – grail Feb 18 '17 at 09:47
Just the names. Ideally, 1st capture would be 'Fred Joseph' and the 2nd capture group 'Smith'. HTMS Dave – Magic Bullet Dave Feb 18 '17 at 09:51
You either need to account for all scenarios if simply using a regex machine, but if using a language (like ruby) I could deliver all the names easily – grail Feb 18 '17 at 10:19

score 0 · Answer 4 · edited Feb 18 '17 at 10:29

0

Hope this helps (.\*?)\\/(.\*?)\\/(.\*)

edited Feb 18 '17 at 10:29

clinton3141

4,751
3
33
46

answered Feb 18 '17 at 09:57

user3507211

51
9

A great answer on StackOverflow includes more than just some code. You can improve your answer by explaining what's going on so that people can learn from it. – clinton3141 Feb 18 '17 at 10:28

Theo · Answer 5 · 2017-02-18T13:11:44.163

Try this: ^([^/]*)(/[^/]+/)?([^/]*)$

This matches the following:

^ start of string (or with multiline modifier start of line)
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
- (/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
- ([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
- $ end of string (or with multiline modifier end of line)

You can see in action with your example text here: https://regex101.com/r/9kmKpy/1

To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes: ^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$

https://regex101.com/r/9kmKpy/2

Regular Expression to Extract Text Bounded by '/'

5 Answers5

Explanation