How to split a string on first number only

Question

So i have a dataset with street adresses, they are formatted very differently. For example:

d <- c("street1234", "Street 423", "Long Street 12-14", "Road 18A", "Road 12 - 15", "Road 1/2")

From this I want to create two columns. 1. X: with the street address and 2. Y: with the number + everything that follows. Like this:

X           Y
Street      1234
Street      423
Long Street 12-14
Road        18A
Road        12 - 15
Road        1/2

Until now I have tried strsplit and followed some similar questions here , for example: strsplit(d, split = "(?<=[a-zA-Z])(?=[0-9])", perl = T)). I just can't seem to find the correct regular expression.

Any help is highly appreciated. Thank you in advance!

Wiktor Stribiżew · Accepted Answer · 2017-02-09T10:38:29.943

10

There may be whitespace between the letter and a digit, so add \s* (zero or more whitespace symbols) between the lookarounds:

> strsplit(d, split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE)
[[1]]
[1] "street" "1234"  

[[2]]
[1] "Street" "423"   

[[3]]
[1] "Long Street" "12-14"      

[[4]]
[1] "Road" "18A" 

[[5]]
[1] "Road"    "12 - 15"

[[6]]
[1] "Road" "1/2"

And if you want to create columns based on that, you might leverage the separate from tidyr package :

> library(tidyr)
> separate(data.frame(A = d), col = "A" , into = c("X", "Y"), sep = "(?<=[a-zA-Z])\\s*(?=[0-9])")
            X       Y
1      street    1234
2      Street     423
3 Long Street   12-14
4        Road     18A
5        Road 12 - 15
6        Road     1/2

edited Feb 09 '17 at 10:38

answered Feb 09 '17 at 10:21

Wiktor Stribiżew

607,720
39
448
563

`do.call('rbind', strsplit(d, split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE))` – Sathish Feb 09 '17 at 10:35
1

@Sathish: Yes, but let's leave something for OP to do. There is no any data frame generation related code in the question itself, it is all about the regex. – Wiktor Stribiżew Feb 09 '17 at 10:37
1

Thanks for all the help. In the end I used colsplit, with the provided regex and then afterwards bound them to the existing dataset. The solution form Sathish is much more elegant, thanks. – Jesse Feb 13 '17 at 11:01

Sotos · Answer 2 · 2017-02-13T13:05:44.037

A non-regex approach using str_locate from stringr to locate the first digit in the string and then split based on that location, i.e.

library(stringr)

ind <- str_locate(d, '[0-9]+')[,1]
setNames(data.frame(do.call(rbind, Map(function(x, y) 
          trimws(substring(x, seq(1, nchar(x), y-1), seq(y-1, nchar(x), nchar(x)-y+1))), 
                                                             d, ind)))[,1:2]), c('X', 'Y'))

#            X       Y
#1      street    1234
#2      Street     423
#3 Long Street   12-14
#4        Road     18A
#5        Road 12 - 15
#6        Road     1/2

NOTE that you receive a (harmless) warning which is a result of the split at "Road 12 - 15" string which gives [1] "Road" "12 - 15" ""

score 3 · Answer 3 · answered Feb 09 '17 at 12:08

3

This will also work:

do.call(rbind,strsplit(sub('([[:alpha:]]+)\\s*([[:digit:]]+)', '\\1$\\2', d), split='\\$'))
#     [,1]          [,2]     
#[1,] "street"      "1234"   
#[2,] "Street"      "423"    
#[3,] "Long Street" "12-14"  
#[4,] "Road"        "18A"    
#[5,] "Road"        "12 - 15"
#[6,] "Road"        "1/2"

answered Feb 09 '17 at 12:08

Sandipan Dey

21,482
2
51
63

2

thanks for pointin out the: [[:alpha:]] and [[:digit:]] solution. makes it more readable – Jesse Feb 13 '17 at 13:00

score 2 · Answer 4 · answered Feb 09 '17 at 17:02

2

We can use read.csv with sub from base R

read.csv(text=sub("^([A-Za-z ]+)\\s*([0-9]+.*)", "\\1,\\2", d), 
        header=FALSE, col.names = c("X", "Y"), stringsAsFactors=FALSE)
#             X       Y
#1       street    1234
#2      Street      423
#3 Long Street    12-14
#4        Road      18A
#5        Road  12 - 15
#6        Road      1/2

answered Feb 09 '17 at 17:02

akrun

874,273
37
540
662

1

interesting solution! – Jesse Feb 13 '17 at 13:00

How to split a string on first number only

4 Answers4

Linked

Related