3

I want to parse apart (extract) addresses into HouseNumber and Streetname. I should later be able to write the extracted "values" into new columns (shops$HouseNumber and shops$Streetname).

So lets say I have a data frame called "shops":

> shops
      Name                 city        street
 1    Something            Fakecity    New Street 3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet 1-3
 3    SomethingDifferent   Fakecity    Fake Street 14a

So is there a way to split the street column into two lists one with the streetnames and one for the house numbers including cases like "1-3","14a", so that in the end, the result could be assigned to the data frame and look like.

 > shops
      Name                 city        Streetname                    HouseNumber
 1    Something            Fakecity    New Street                    3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet   1-3
 3    SomethingDifferent   Fakecity    Fake Street                   14a 

Example: Easyfakestreet 5 --> Easyfakestreet , 5

It gets slightly complicated by the fact that some of my street strings will have hyphenated street addresses and have non numerical components.

Examples:
New Street 3 --> ['New Street', '3 ']
Some-Complicated-Casestreet 1-3 --> ['Some-Complicated-Casestreet','1-3']
Fake Street 14a --> ['Fake Street', '14a']

I would appreciate some help!

Deset
  • 877
  • 13
  • 19

5 Answers5

8

Here's a possible tidyr solution

library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
#                 Name     city                   Streetname HouseNumber
# 1          Something Fakecity                  New Street            3
# 2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
# 3 SomethingDifferent Fakecity                 Fake Street          14a
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Thanks a lot! That worked for all my cases even the ones I did not identify before, like a missing space between the Street name and House Number. – Deset May 06 '15 at 11:24
5

You can try:

shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)

data

shops$street
#[1] "New Street 3"                    "Some-Complicated-Casestreet 1-3" "Fake Street 14a" 

results

shops$Streetname
#[1] "New Street"                  "Some-Complicated-Casestreet" "Fake` Street" 

shops$HousNumber
#[1] "3"   "1-3" "14a"
Cath
  • 23,906
  • 5
  • 52
  • 86
  • Thank you! That worked good. But I saw it did not work on one case which I myself didn`t identify before, where the formating was wrong "Fake Street48" (missing space). – Deset May 06 '15 at 11:22
  • @Deset, indeed, imy `regex` couldn't work if the space was missing – Cath May 06 '15 at 11:56
2

Create a pattern with back references that match both the street and the number and then using sub replace it by each backreference in turn. No packages are needed:

pat <- "(.*) (\\d.*)"
transform(shops,
   street = sub(pat, "\\1", street), 
   HouseNumber = sub(pat, "\\2", street)
)

giving:

                Name     city                      street  HouseNumber
1          Something Fakecity                  New Street            3
2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
3 SomethingDifferent Fakecity                 Fake Street          14a

Here is a visualization of pat:

(.*) (\d.*)

Regular expression visualization

Debuggex Demo

Note:

1) We used this for shops:

shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3", 
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name", 
"city", "street"), class = "data.frame", row.names = c(NA, -3L))

2) David Arenburg's pattern could alternately be used here. Just set pat to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

You could use the package unglue

library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#>                 Name     city                      street value
#> 1          Something Fakecity                  New Street     3
#> 2     SomethingOther Fakecity Some-Complicated-Casestreet   1-3
#> 3 SomethingDifferent Fakecity                 Fake Street   14a

Created on 2019-10-08 by the reprex package (v0.3.0)

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
0

Very complex issue for international addresses

$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a 
123/3 dsdsdfs
Roobertinkatu 36-40 
Flats 1-24 Acacia Avenue 
Apartment 9D, 1 Acacia Avenue 
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2 
Apartment 5005  no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 ΒΊ 
102 - 3 Esq 
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5 
11334 Nc Highway 72 E ';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

Output example

https://regex101.com/r/WVPBji/1

Daves
  • 1
  • 1