2

I have coordinates in various formats and try to get a more or less universal conversion routine.

For this i try to parse the individual elements of the string with a regex expression and try to get the individual information for degree, minute and second via their index of appearance in the string.

For some it works....but not for all. I am pretty convinced that my problem is closely correlated with my limited understanding of regex.

Thus the question: Someone who has a better understanding of the regex pattern and may help?

I tried to compile a short piece of code to demonstrate the problem. Running the example below shows that i get three components for the first four and last three coordinates. The rest -in between- delivers just 2 components....

coords = c("-53°30''30.54'",
       "s55°30' 30.54",
       "55°30'30.54n",
       "0°1 0.5S",
       "-0°30'30''s",
       "S55 30 30",
       "-55°30'30''",
       "-55° 30' 30''",
       "-55°   30'   30",
       "-55 sometimes with text rests 30 30''",
       "55°30'30,54S",
       "S55° 30' 30,54",
       "-55° 30' 30.54''"
       )

for (i in 1:length (coords)) {
    pattern   <- gregexpr ("[0-9.]+", coords [i])
    print (as.character (unique (unlist (regmatches (coords [i], pattern)))))
}


<Output>
[1] "53"    "30"    "30.54"
[1] "55"    "30"    "30.54"
[1] "55"    "30"    "30.54"
[1] "0"   "1"   "0.5"
[1] "0"  "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30" "54"
[1] "55" "30" "54"
[1] "55"    "30"    "30.54"

The below regex expression is a pretty impressive monster ;-) Nevertheless, it has some problems when the coordinates are in a slightly different format (e.G. dec_deg). In this case the first or the second number of the string are not correctly identified. I just compiled a list with such coordinates:

coords = c("-53°30''30.54'", "s55°30' 30.54", "55°30'30.54n", "0°1 0.5S", "-0°30'30''s", "S55 30 30", "-55°30'30''", "-55° 30' 30''", "-55° 30' 30", "-55 sometimes with text rests 30 30''", "55°30'30,54S", "S55° 30' 30,54", "-55° 30' 30.54''", "-55.5432 30 30.54", "-55.30.30", "55.555", "55,555S", "S55,555", "S55.555", "55,555°S", "55.555°", "-55,555", "-55.555"

       )
Jan Schulz
  • 31
  • 3
  • 1
    What format is your goal exactly? You can likely bypass doing the regex yourself and use something like the `measurements` package's unit conversion, but that depends on your goal after you've parsed the pieces of the coordinates – camille Oct 06 '20 at 15:19
  • No, i know the measurements package and i am not happy with that. The above listed code snippet is not the full story. I will need a more versatile function, where i can use some correction terms. Thus, i need the first/second/third value separated. I once did it with a complex parsing, but there must be an option to do it with regex... – Jan Schulz Oct 06 '20 at 15:24
  • 1
    And you don't need to keep the negative signs or directions ("S", etc)? That's why it would be helpful to see what you need to be able to do with your output – camille Oct 06 '20 at 15:29
  • i am re-parsing the string later on for the minus or S or W to identify negative values. One of the major problems is, that the variety of input data formats is broad and not regulated. – Jan Schulz Oct 06 '20 at 15:41

2 Answers2

1

We can try using regexec along with regmatches to match exactly three numbers in each row. A "number" here is defined as either an integer or an integer with a decimal component (the decimal point being either dot or comma).

We can convert the list-of-vector output from the above to a matrix using do.call.

regex <- "^.*?(-?\\d+(?:[,.]\\d+)?).*?(-?\\d+(?:[,.]\\d+)?).*?(-?\\d+(?:[,.]\\d+)?).*$"
do.call(rbind, lapply(regmatches(coords, regexec(regex, coords)), function(x) x[2:4]))

      [,1]  [,2] [,3]   
 [1,] "-53" "30" "30.54"
 [2,] "55"  "30" "30.54"
 [3,] "55"  "30" "30.54"
 [4,] "0"   "1"  "0.5"  
 [5,] "-0"  "30" "30"   
 [6,] "55"  "30" "30"   
 [7,] "-55" "30" "30"   
 [8,] "-55" "30" "30"   
 [9,] "-55" "30" "30"   
[10,] "-55" "30" "30"   
[11,] "55"  "30" "30,54"
[12,] "55"  "30" "30,54"
[13,] "-55" "30" "30.54"
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

It seems to work OK with stringr...

library(stringr)
str_extract_all(str_replace_all(coords, ",", "."), "[0-9.\\-]+")

[[1]]
[1] "-53"   "30"    "30.54"

[[2]]
[1] "55"    "30"    "30.54"

[[3]]
[1] "55"    "30"    "30.54"

[[4]]
[1] "0"   "1"   "0.5"

[[5]]
[1] "-0" "30" "30"

[[6]]
[1] "55" "30" "30"

[[7]]
[1] "-55" "30"  "30" 

[[8]]
[1] "-55" "30"  "30" 

[[9]]
[1] "-55" "30"  "30" 

[[10]]
[1] "-55" "30"  "30" 

[[11]]
[1] "55"    "30"    "30.54"

[[12]]
[1] "55"    "30"    "30.54"

[[13]]
[1] "-55"   "30"    "30.54"
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32