0

I have an old client database (.csv) of addresses. The biggest problem is they are not consistent and when I separate it, the Municipality is either in the Area, or City or so on... Example:

(header) Country, Municipality, City, Detailed address(street name, number, floor, ap.)

**(proper) Count.xxxxxx, Mun.xxxxx, City.xxxx**

(case 1) Count.xxxxxx, City.xxxx, Mun.xxxxx

(case 2) Count.xxxxxx, City.xxxx, -Mun.xxxxx

(case 3) City.xxxx, Count.xxxxxx, Mun.xxxxx

(case 4) Mun.xxxxx, City.xxxx, Count.xxxxxx

(case 5) Mun.xxxxx, Count.xxxxxx, City.xxxx 

"xxxx" = various names, contains also numbers, spaces and ".

I tried to reorder them all on the following format: Count.,Mun.,City. but everything I saw and tried was more like sorting and filtering

I need help reordering, so that the database is consistent and all the data is in the appropriate column.


more sophisticated examples:

Country,Area,Municipality,City,Detailed address street/boulevard number entrance floor ap. number (Detailed address is like Boul. Bulgaria 100 entr.A fl.4 ap.256)

As you can imagine not all fields are filled, and sometimes fields are not separated with "," (but this is a problem I will have to live with... cannot go throough 65k rows...)

Count.xxxxx, Area.xx xxx, Munic.xxxxx, Cit.xxxxx, Addr.xxxxx

Area.xxxxx, Munic.xxxxx, Cit.xxxxx, Addr.xxxxx Munic.xxxxx, Cit.xxxxx,
Addr.xx xxx, Count.xxxxx Count.xxxxx, Munic.xxxxx, Cit.xxxxx, Addr.xxxxx
Munic.xxxxx, Vill.xxxxx Area.xxxxx, Addr.xxxxx Munic.xxxxx, Cit.xxxxx
Cit.xxxxx, Munic.xx xxx, Addr.xxx xx

and the other thing is that it could be either City or Village (ct. vill.)

Mihail
  • 15
  • 5
  • 1
    Could you post a fictional example of some different rows of the database? this would give us something to work with :) – Bas Oct 12 '15 at 17:11

1 Answers1

2

It sounds like you just need to grab the county, city, and municipality from each row. You could do this by using grep to grab the proper row elements:

data.frame(County = apply(dat, 1, grep, pattern="Count\\.", value=TRUE),
           City = apply(dat, 1, grep, pattern="City\\.", value=TRUE),
           Mun = apply(dat, 1, grep, pattern="Mun\\.", value=TRUE))
#     County   City   Mun
# 1  Count.1 City.1 Mun.4
# 2  Count.3 City.2 Mun.7
# 3  Count.2 City.5 Mun.8
# 4  Count.2 City.2 Mun.1
# 5 Count.10 City.2 Mun.6
# 6  Count.1 City.1 Mun.4

Data:

(dat = data.frame(A=c("Count.1", "Count.3", "City.5", "City.2", "Mun.6", "Mun.4"),
                  B=c("City.1", "Mun.7", "Count.2", "Mun.1", "Count.10", "City.1"),
                  C=c("Mun.4", "City.2", "Mun.8", "Count.2", "City.2", "Count.1"),
                  stringsAsFactors=FALSE))
#         A        B       C
# 1 Count.1   City.1   Mun.4
# 2 Count.3    Mun.7  City.2
# 3  City.5  Count.2   Mun.8
# 4  City.2    Mun.1 Count.2
# 5   Mun.6 Count.10  City.2
# 6   Mun.4   City.1 Count.1
josliber
  • 43,891
  • 12
  • 98
  • 133
  • josilber, when i ran I got this: Error in data.frame(character(0), character(0), character(0), character(0), : arguments imply differing number of rows: 0, 1 – Mihail Oct 13 '15 at 10:38
  • Error fixed. Had to add the package "data.table". Instead of "(dat = data.frame(..." I replaced it with "(dat = data.table(..." and now it works like a russian missle - spot on!!! :) Thank you – Mihail Oct 13 '15 at 10:39