0

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address. As seen below the address is in one string.

Head(DF, 3)

Address                            Price        m2        Rooms   
Petersvej   1772900  Hoersholm     10.000       210       5        
Annasvej    2B2900   Hoersholm     15.000       230       4        
Krænsvej    125800   Lyngby C      10.000       210       5               
A Mivs Alle 119800   Hjoerring      1.300        70       3        

The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).

Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..

any pointers are very much appreciated

Thomas
  • 17
  • 6

3 Answers3

3

1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.

library(dplyr)
library(tidyr)

DF %>% 
  separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
  separate(Number, into = c("StreetNo", "Postal"), sep = -4)

giving:

       Road StreetNo Postal      City Price  m2 Rooms      CITY
1 Petersvej       77   2900 Hoersholm    10 210     5 Hoersholm
2  Annasvej     121B   2900 Hoersholm    15 230     4 Hoersholm
3  Krænsvej       12   5800  Lyngby C    10 210     5         C

2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.

DF %>% 
  mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
  separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")

Note

The input DF in reproducible form is:

DF <- 
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm", 
"Krænsvej 125800   Lyngby C", "Petersvej 772900  Hoersholm"), class = "factor"), 
    Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L, 
    4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C", 
    "Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA, 
-3L))

Update

Added and fixed (2).

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • thank you! This works in most cases. But as shown: [link](https://imgur.com/a/WZHuqhr) it isn't working for all instances. – Thomas Aug 27 '18 at 17:03
  • It covers all cases covered by the example in the question. Please fix the example in the question if you want to change it by showing the output of a revised `dput(DF) `suitably cut down to include all cases you want to cover. Do not use images and do not just show the output of `head`. – G. Grothendieck Aug 27 '18 at 17:33
1

Check out the cSplit function from the splitstackshape package

library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space

#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
SmitM
  • 1,366
  • 1
  • 8
  • 14
0

One way to do this is with regex.

In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.

    library(stringr)

    DF <- data.frame(Address=c("Petersvej 772900  Hoersholm",
      "Annasvej 121B2900 Hoersholm",
      "Krænsvej 125800   Lyngby C"))
    DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))

This will give you the following output:

                          Address      CITY
    1 Petersvej 772900  Hoersholm Hoersholm
    2 Annasvej 121B2900 Hoersholm Hoersholm
    3  Krænsvej 125800   Lyngby C  Lyngby C

In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

Weston
  • 1