1

I am trying to parse a GEDCOM file using regular expressions and am almost there, but the expression grabs the next line of the text for lines where there is optional text at the end of line. Each record should be a single line.

This is an extract from the file:

0 HEAD
1 CHAR UTF-8
1 SOUR Ancestry.com Family Trees
2 VERS (2010.3)
2 NAME Ancestry.com Family Trees
2 CORP Ancestry.com
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
0 @P6@ INDI
1 BIRT

And this is the regular expression I am using:

(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)

This works for all lines except those that do not contain any text at the end, such as the first one. For instance, the last capture group for the first record contains the '1 CHAR UTF-8'.

Here's a screenshot from regex101.com, showing how the purple capture group bleeds onto the next line:

Regex101 screen shot

I have tried using the $ qualifier to limit the .* to just line ends, but this fails as the second line is also a line end.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Magic Bullet Dave
  • 9,006
  • 10
  • 51
  • 81
  • 1
    `\s` matches newlines, try replacing it with a regular space, or `[^\S\r\n]` (or `\h` if it is PCRE). See https://regex101.com/r/N2ZWWo/1 (a `^` is added with the multiline option, too). – Wiktor Stribiżew Feb 13 '17 at 11:10
  • Great thanks Wiktor, if you want to create an answer I will mark as best. This seems to do the trick: (\d+) +(@\S+@)? *(\S+) *(.*) – Magic Bullet Dave Feb 13 '17 at 11:14
  • `.*` is greedy by default and will match as much as it can. Try `.*?$` to make it a non-greedy match. – phuzi Feb 13 '17 at 11:21

1 Answers1

3

The \s pattern matches newline symbols. Replace it with a regular space, or [^\S\r\n], or \h if it is PCRE, or [\p{Zs}\t].

(\d+) +(@\S+@)? *(\S+) +(.*)

See the regex demo

If you need to match lines, you may add a multiline option and add anchors (^ at the start and $ at the end of the patten) on both sides (see another demo).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563