0

I have an input of lines of text, of which some contain relevant GeoData of Objects. I want to identify the relevant lines by matching a given prefix (that identifies the following coordinates as belonging to the desired Geo-Object).

It may look similar to the following:

/line with irrelevant prefix
/line with irrelevant prefix, potentially also containing coordinates
[relevant prefix][bunch of characters][1 to X coordinates in the form "lat: X.XXXXXX, lon:X.XXXXXX"][bunch of other characters][other potentially relevant information]
/line with irrelevant prefix
/line with irrelevant prefix

The actual data I want to extract is in the form of a String of coordinates (LineString), in order to generate an object representing the LineString for further use in my C# code. Additional attributes (such as a name or an ID for example) might be relevant, too. However, I also want to disqualify lines that may contain coordinates, but do not include my relevant prefix.

From what I understand, I can use named capturing groups in Regexes to get substrings as "variables" from the relevant lines, like this (please do not mind the imprecise format for the coordinate):

(lat=(?<lat>\d{0,2}[.,]\d+)), (lon=(?<lon>\d{0,2}[.,]\d+))

However, as far as I can see, I cannot have my expressions match an arbitrary number of coordinates in the line (since I do not know the length of the LineString object) and at the same time have the expression match the prefix pattern.

Is there a solution to have the expression match the prefix, have named capturing groups for the arbitrarily many pairs of coordinates, and also have additional named capturing groups for potentially relevant additional variables?

Suppose I have the following line:

PREFIXafkzh(lat=34.42344, lon=23.6346jsdfkh,lat=2.4234, lon=12.124)

I have tried the following regex:

(lat=(?<lat>\d{0,2}[.,]\d+)), (lon=(?<lon>\d{0,2}[.,]\d+))

This matches all coordinates, but not the prefix

(PREFIX).\*(lat=(?<lat>\d{0,2}[.,]\d+)), (lon=(?<lon>\d{0,2}[.,]\d+))
(PREFIX).\*((lat=(?<lat>\d{0,2}[.,]\d+)), (lon=(?<lon>\d{0,2}[.,]\d+)))+

Both of these will only match the last coordinate in the line.

halfer
  • 19,824
  • 17
  • 99
  • 186
Jannik Michel
  • 366
  • 1
  • 11

1 Answers1

1

It's not very easy to know what kind of prefixes you have and if there's one for the line and another prefix just for a group of coordinates. You may have to add a few more examples of input data.

I understood that you have a prefix for the line followed by a ( then some coordinates that could also be prefixed by a word and a comma.

Assuming that, I came out with this .NET regex:

@"(?<line_prefix>\w+)\(          # Line word prefix followed by (.
(?:                              # Non-capturing group to match multiple coords.
  (?<coord_prefix>[\w,]*)        # Optional coord prefix (may have to adapt).
  (?:lat=(?<lat>\d{0,2}[.,]\d+)) # Latitude value.
  ,\s*                           # Comma and optional spaces.
  (?:lon=(?<lon>\d{0,2}[.,]\d+)) # Longitude value.
)+"gx

You can test it here: https://regex101.com/r/vxjeww/3

You may have to adapt the line prefix and coordinate prefix pattern as we don't know clearly what it can contain.

As you see, you should then get multiple variables as the non-capturing group can be matched 1 or several times.

Patrick Janser
  • 3,318
  • 1
  • 16
  • 18