2

I need to a regular expression to extract names from a GEDCOM file. The format is:

Fred Joseph /Smith/

Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.

This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:

What I have so far

As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.

Any help would be much appreciated.

Magic Bullet Dave
  • 9,006
  • 10
  • 51
  • 81

5 Answers5

3

For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.

Here is my proposed solution. It doesn't capture spaces around forenames.

^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$

To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.

See the demo

Explanation

  • ^ start of the line
  • (?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
    • \h* 0 or more horizontal spaces
    • ([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
    • \h* matches the possible remaining spaces without capturing
  • (?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
  • (?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
  • $ end of the line
Niitaku
  • 835
  • 9
  • 19
0

For your requirements

([A-z a-z /])+\w*

Sample

Sandeep Bhaskar
  • 300
  • 2
  • 12
0

I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':

(.*)(\/?.*\/?)(.*)

Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group

Edit:

Extending on Niitaku solution and looking at having each individual name in its own group, you could use:

^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$

As explained though, if using a language like ruby it would simply be:

ruby -pe '$_ = $_.scan(/\w+/)' file
grail
  • 914
  • 6
  • 14
  • Thanks grail. When I do that the first capture group captures everything including, i.e., Fred Joseph /Smith/. I am using NSRegularExpression, but testing using https://regex101.com with the pcre flavour. – Magic Bullet Dave Feb 18 '17 at 09:38
  • Something I am confused on, what is the desired output? Should we be capturing the '/' as part of a group or do you just want the names? – grail Feb 18 '17 at 09:47
  • Just the names. Ideally, 1st capture would be 'Fred Joseph' and the 2nd capture group 'Smith'. HTMS Dave – Magic Bullet Dave Feb 18 '17 at 09:51
  • You either need to account for all scenarios if simply using a regex machine, but if using a language (like ruby) I could deliver all the names easily – grail Feb 18 '17 at 10:19
0

Hope this helps (.\*?)\\/(.\*?)\\/(.\*)

clinton3141
  • 4,751
  • 3
  • 33
  • 46
  • A great answer on StackOverflow includes more than just some code. You can improve your answer by explaining what's going on so that people can learn from it. – clinton3141 Feb 18 '17 at 10:28
0

Try this: ^([^/]*)(/[^/]+/)?([^/]*)$

This matches the following:

  • ^ start of string (or with multiline modifier start of line)
  • ([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
    • (/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
    • ([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
    • $ end of string (or with multiline modifier end of line)

You can see in action with your example text here: https://regex101.com/r/9kmKpy/1

To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes: ^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$

https://regex101.com/r/9kmKpy/2

Theo
  • 1,608
  • 1
  • 9
  • 16