1

My data set is not in English but in Korean. The number of observations is more than 3000.

The data set's name is demo.

str(demo)

This has information of each person in each row.

$ 거주지역: Factor w/ 900 levels "","강원 강릉시 포남1동",..: 595 235 595 832 12 126 600 321 600 589      ...

Above is the 4th column's structure of the data set.

I want to make groups according to 4th column which indicates addresses of people. The problem is that the level of the factor is 900. This happens because the addresses are fully written.

I want to make groups to assign people in some provinces. So R needs to read the factors and identify the letters to make groups.

How can I do this? Please give me a help. I googled it for so much time but I could not find it.

Doo Hyun Shin
  • 297
  • 3
  • 15
  • I understand that you want to make `groups`. But, without further info along with reproducible data (in English), it is difficult to help. – akrun Dec 29 '14 at 14:02
  • Hi, so is the problem that you need to subset the strings giving the full address to pull out the provinces? Are the provinces always in the same place in the address? For a start, you might look at `strsplit` or `gsub`. – andybega Dec 29 '14 at 14:04
  • @andybega Indeed, it is always in the same place in the address. – Doo Hyun Shin Dec 29 '14 at 14:06
  • @akrun I am not sure you can read the reproducible data because it is not in English. But I can try. – Doo Hyun Shin Dec 29 '14 at 14:07
  • @DooHyunShin No, I won't be able to :-) – akrun Dec 29 '14 at 14:08
  • Can you give a few example addresses and where the province is identified in them? Maybe just bold the word containing the province? – andybega Dec 29 '14 at 14:16

1 Answers1

1

Here's maybe a start, not sure how it will work with non-Latin characters.

foo <- data.frame(value=rnorm(3), 
                  address=c("blah blah province1", "blah blah province2", "province3"),
                  stringsAsFactors=FALSE)

words <- strsplit(foo$address, " ")
words <- do.call(rbind, words)
foo$province <- words[, 3]

head(foo)

Output:

       value             address  province
1 0.01129269 blah blah province1 province1
2 0.99160104 blah blah province2 province2
3 1.59396745           province3 province3

Guessing by this wiki page on South Korean address formats, if the city and province (ward?) are always in the beginning of the address, then it's a bit easier and we can avoid using rbind, which in the code above recycles shorter addresses.

foo <- data.frame(value=rnorm(3), 
                  address=c("seoul ward1 street", "seoul ward2 street", "not-seoul ward-something     street"),
                  stringsAsFactors=FALSE)

foo$city <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][1])
foo$ward <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][2])

Now we can also use ifelse to use wards if in Seoul and cities otherwise.

foo$group <- with(foo, ifelse(city=="seoul", ward, city))
foo

       value                         address      city           ward     group
1  1.0071995              seoul ward1 street     seoul          ward1     ward1
2  0.7192918              seoul ward2 street     seoul          ward2     ward2
3 -0.6047117 not-seoul ward-something street not-seoul ward-something not-seoul
andybega
  • 1,387
  • 12
  • 19
  • Thank you very much but one more thing... One city has 10 million population but the other cities have little population. So in one city with big population, I need to make groups according to provinces. This is what exactly you helped me with you answer. However, for the cities with little population, I need to split the groups by cities's names... Do you think it is possible with your answer above? Thank you very much again. – Doo Hyun Shin Dec 29 '14 at 14:19
  • I think so. If the cities are also always the same word order, you can use similar code to create a variable for cities, then use two two for a third variable that you set with maybe something like `ifelse(city=="Seoul", province, city)`. – andybega Dec 29 '14 at 14:27
  • Even though I used "stringsAsFactors=FALSE" the data frame has addresses as factor. – Doo Hyun Shin Dec 29 '14 at 15:25
  • Well, I solved the factor problem. But I still have one more problem. After the coding you taught me "words <- strsplit(foo$address, " ")" words has 5000 elements, some of which have structure of $ : chr [1:3] "서울" "송파구" "가락본동", while some other have structure of $ : chr [1:4] "경기" "고양시" "일산동구" "마두1동". Their length is a little bit different. After the coding you taught me "words <- do.call(rbind, words)" words has 4985 obs. It needs to have 5000 obs. How can I solve this problem? – Doo Hyun Shin Dec 29 '14 at 15:45
  • 1
    Where are city and province in the 3 and 4 word addresses? Based on [this](http://en.wikipedia.org/wiki/Addresses_in_South_Korea) I'm guessing the city is first and word (province) is second. In that case you can avoid using `rbind`, see the edit i've made. – andybega Dec 30 '14 at 14:11
  • I found out that there are 15 obs which have missing values in addresses. Thank you very much – Doo Hyun Shin Dec 31 '14 at 13:27