1

I'am trying to extract sets of coordinates from strings and change the format.

I have tried some of the stringr package and getting nowhere with the pattern extraction. It's my first time dealing with regex and still is a little confusing to create a pattern.

There is a data frame with one column with one or more sets of coordinates. The only pattern (the majority) separating Lat from Long is (-), and to separate one set of coordinates to another there is a (/)

Here is an example of some of the data:

ID  Coordinates
1   3438-5150
2   3346-5108/3352-5120 East island, South port
3   West coast (284312 472254)
4   28.39.97-47.05.62/29.09.13-47.44.03
5   2843-4722/3359-5122(1H-2H-3H-4F)

Most of the data is in decimal degree, e.g. (id 1 is Lat 34.38 Lon 51.50), some others is in 00º00'00'', e.g. (id 4 is Lat 28º 39' 97'' Lon 47º 05' 62'')

I will need to make in a few steps

1 - Extract all coordinates sets creating a new row for each set of each record;

2 - Extract the text label of record to a new column, concatenating them;

3- Convert the coordinates from 00º00'00''(28.39.97) to 00.0000º (28.6769 - decimal dregree) so all coordinates are in the same format. I can easily convert if they are as numeric.

4 - Add dot (.) to separate the decimal degree values (from 3438 to 34.38) and add (-) to identify as (-34.38) south west hemisphere. All value must have (-) sign.

I'am trying to get something like this:

Step 1 and 2 - Extract coordinates sets and names

ID  x           y          label
1   3438        5150      
2   3346        5108      East island, South port
2   3352        5120      East island, South port
3   284312      472254    West coast
4   28.39.97    47.05.62    
4   29.09.13    47.44.03
5   2843        4722      1H-2H-3H-4F
5   3359        5122      1H-2H-3H-4F

Step 3 - convert coordinates format to decimal degree (ID 4)

ID  x           y       label
1   3438        5150    
2   3346        5108    East island, South port
2   3352        5120    East island, South port
3   284312      472254  West coast
4   286769      471005  
4   291536      470675
5   2843        4722      1H-2H-3H-4F
5   3359        5122      1H-2H-3H-4F

Step 4 - change display format

ID   x          y         label
1   -34.38      -51.50    
2   -33.46      -51.08    East island, South port
2   -33.52      -51.20    East island, South port
3   -28.43      -47.22    West coast
4   -28.6769    -47.1005    
4   -29.1536    -47.0675
5   -28.43      -47.22    1H-2H-3H-4F
5   -33.59      -51.22    1H-2H-3H-4F   

I have edit the question to better clarify my problems and change some of my needs. I realized that it was messy to understand.

So, has anyone worked with something similar? Any other suggestion would be of great help.

Thank you again for the time to help.

aoceano
  • 85
  • 1
  • 13
  • 1
    The regex to extract all your coordinates is [`(\d{2})\.?(\d{2})(?:\.(\d{2}))?-(\d{2})\.?(\d{2})(?:\.(\d{2}))?`](https://regex101.com/r/vY4iE5/1). You need to search for it globally (`g` modifier). Then if `$3` and `$6` are non empty, you've got minutes and seconds so convert it to decimals of degree (in your R code). Unfortunately I'm not of much help when it comes to R so maybe someone else will pick up the regex and prepare an R example. – Dmitry Egorov Aug 18 '16 at 04:00
  • Can you clarify your second question ? That use case is not clear. Can you add input data for your second example and the corresponding output. – steveb Aug 18 '16 at 04:28
  • Thanks a lot @DmitryEgorov it will help a lot to understand better regex pattern. – aoceano Aug 18 '16 at 12:08
  • @steveb my second example would be another way of extract the coordinates. Instead of creating new columns for each new set of the same record, I would create new rows with the same ID. A few records have up to 5 sets of coordinates in the same string. It is more of a question regarding of if would be a good solution. – aoceano Aug 18 '16 at 12:10
  • @steveb I changed the question to better clarify my situation and improved the example. – aoceano Aug 18 '16 at 14:26
  • @aoceano The post no longer appears to have the original data, also this means some of the sample data format no longer appears to be there. The missing data I am referring to is `2843-4722/3359-5122(1H-2H-3H-4F)`. It is a good idea to include all use cases you need to address and the expected output for all of the use cases. That will make it easier to address all situations. – steveb Aug 18 '16 at 14:33
  • @steveb Thanks. I forgot to adjust the step 4. The 2843-4722/3359-5122(1H-2H-3H-4F) was removed since it is as 3346-5108/3352-5120 East island, South port. The (1H-2H-3H-4F) is just like a name. I added the West coast (284312 472254) since the coordinates is inside () and don't have a (-) to separate them. – aoceano Aug 18 '16 at 14:54
  • @aoceano How should one determine if the numeric values should be negative ? Just to clarify, will data of the form `2843-4722/3359-5122(1H-2H-3H-4F)` still be in your data set ? – steveb Aug 18 '16 at 15:07
  • @steveb all values are negative. All data is from the South west hemisphere. So people tend not to use the (-). The data 2843-4722/3359-5122(1H-2H-3H-4F) form is still in the data. Since I consider (1H-2H-3H-4F) as a name I exchanged to another data to try to reduce the example. – aoceano Aug 18 '16 at 16:23
  • For completeness, please add that use case to the initial question. If there are others that are not there you should add those too. – steveb Aug 18 '16 at 16:43
  • @steveb all done. all cases found in the dataset. Can't thank enough for taking the time to help me. – aoceano Aug 18 '16 at 17:10
  • @aoceano FYI, I just noticed you now have two output columns and not four. I will post the solution for four (as originally asked) but you can remove the unwanted columns. – steveb Aug 18 '16 at 18:15

2 Answers2

3

Note: the first answers address the original asking of the question and the last answer addresses its current state. The data in data1 should be set appropriately for each solution.

The following should address your first question given the data you provided and the expected output (using dplyr and tidyr).

library(dplyr)
library(tidyr)

### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120", 
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA, 
-4L))

### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }

### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
    mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
           Coordinates = gsub('(\\d+[.]\\d+)[.]\\d+', '\\1', Coordinates)) %>%
    separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
    mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
##   ID   x.1   y.1   x.2   y.2
## 1  1 34.38 51.50    NA    NA
## 2  2 33.46 51.08 33.52 51.20
## 3  3 28.43 47.22 33.59 51.22
## 4  4 28.39 47.05 29.09 47.44

The call to mutate modifies Coordinates twice to make substitutions easier.

Edit

A variation that uses another regex substitution instead of mutate_at.

data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
       Coordinates = gsub('(\\d{2}[.]\\d{2})[.]\\d{2}', '\\1', Coordinates),
       Coordinates = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)

Edit 2: The following solution addresses the updated version of the question

The following solution does a number of transformations to transform the data. These are separate to make it a bit easier to think about (much easier relatively speaking).

library(dplyr)
library(tidyr)

data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port", 
"East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03", 
"2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates"
), class = "data.frame", row.names = c(NA, -5L))

### Function for converting to numeric values and
### handles case of "12.34.56" (hours/min/sec)
hms_convert <- function(llval) {
  nres <- rep(0, length(llval))
  coord3_match_idx <- grepl('^\\d{2}[.]\\d{2}[.]\\d{2}$', llval)
  nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))})
  nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx])
  nres
}

### Each mutate works to transform the various data formats
### into a single format.  The 'separate' commands then split
### the data into the appropriate columns.  The action of each
### 'mutate' can be seen by progressively viewing the results
### (i.e. adding one 'mutate' command at a time).
data1 %>%
  mutate(Coordinates_new = Coordinates) %>%
  mutate(Coordinates_new = gsub('\\([^) ]+\\)', '', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(.*?)\\(((\\d{6})[ ](\\d{6}))\\).*', '\\3-\\4 \\1', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(\\d{2})(\\d{2})(\\d{2})', '\\1.\\2.\\3', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(\\S+)[\\s]+(.+)', '\\1|\\2', Coordinates_new, perl = TRUE)) %>%
  separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>%
  mutate(Coords = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coords)) %>%
  separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
  mutate_at(vars(matches('^[xy][.]')), hms_convert) %>%
  mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x))

##   ID                                 Coordinates       x.1       y.1       x.2       y.2                   label
## 1  1                                   3438-5150 -34.38000 -51.50000        NA        NA                    <NA>
## 2  2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port
## 3  3                  East coast (284312 472254) -28.72000 -47.38167        NA        NA             East coast 
## 4  4         28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417                    <NA>
## 5  5            2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000                    <NA>
steveb
  • 5,382
  • 2
  • 27
  • 36
  • Thank you very much for the help. I really liked the solution. I just think I will need to make in two steps. Since I have some coordinates as 28.39.97-47.05.62 (Lat 28º 39' 97'' Lon 47º 05' 62'') I will need to first address this issue. I will need to extract this data and convert to decimal degree as the most of the values ( 28º 39' 97'' -> is 28.6769 and not 28.39). Than I can safely extract the coordinates all in the same format. Messy huh? – aoceano Aug 18 '16 at 13:51
  • @aoceano If you decide to change the input again, please pose a new question. This kind of data cleaning can be time consuming to work with. – steveb Aug 18 '16 at 18:29
  • 1
    Thank you very much, can't express myself. The solution has everything that I need and the way you putted I can easily understand and think about. Already made lot's of notes about the step-by-step process and will be using some of them for several others issues. I'll be sure to pay more attention to the question so I don't have to edit the problem to make better for everyone. Thank you one more time for taking your time to help people like me and so many others. Best regards. – aoceano Aug 18 '16 at 19:07
3

We can use stringi. We create a . between the 4 digit numbers with gsub, use stri_extract_all (from stringi) to extract two digit numbers followed by a dot followed by two digit numbers (\\d{2}\\.\\d{2}) to get a list output. As the list elements have unequal length, we can pad NA at the end for those elements that have shorter length than the maximum length and convert to matrix (using stri_list2matrix). After converting to data.frame, changing the character columns to numeric, and cbind with the 'ID' column of the original dataset.

library(stringi)
d1 <- as.data.frame(stri_list2matrix(stri_extract_all_regex(gsub("(\\d{2})(\\d{2})", 
  "\\1.\\2", data1$Coordinates), "\\d{2}\\.\\d{2}"), byrow=TRUE), stringsAsFactors=FALSE)
d1[] <- lapply(d1, as.numeric)
colnames(d1) <-  paste0(c("x.", "y."), rep(1:2,each = 2))

cbind(data1[1], d1)
#  ID   x.1   y.1   x.2   y.2
#1  1 34.38 51.50    NA    NA
#2  2 33.46 51.08 33.52 51.20
#3  3 28.43 47.22 33.59 51.22
#4  4 28.39 47.05 29.09 47.44

But, this can also be done with base R.

#Create the dots for the 4-digit numbers
str1 <- gsub("(\\d{2})(\\d{2})", "\\1.\\2", data1$Coordinates)
#extract the numbers in a list with gregexpr/regmatches
lst <- regmatches(str1, gregexpr("\\d{2}\\.\\d{2}", str1))
#convert to numeric
lst <- lapply(lst, as.numeric)
#pad with NA's at the end and convert to data.frame
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
#change the column names
colnames(d1) <-  paste0(c("x.", "y."), rep(1:2,each = 2))
#cbind with the first column of 'data1'
cbind(data1[1], d1)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you very much for the help. I understood what you did. I think my question wasn't very clear regarding the issues. I changed the question and added better example. I tried you code and still trying to make it work in my new scenario. Any thoughts would be more than welcome. Regards. – aoceano Aug 18 '16 at 14:28
  • 1
    @aoceano I am sorry that the solution was based on your initial input. If you say that that was not your input, it is kind of wasting the time. – akrun Aug 18 '16 at 15:26
  • 1
    I'am deeply sorry for wasting your time. I really appreciate what people do here and take their time to help others. It was my mistake not to clarify better and provide better example of my problem. Nevertheless I learned a lot from your answer and I will used in other cases in my work and It will help improve my R knowledge. Thank you one more time and again, really sorry. – aoceano Aug 18 '16 at 16:27