extracting a word (of variable length) ending with 동 from a string in R

Question

I have a data frame in R with one column containing an address in Korean. I need to extract one of the words (a word ending with 동), if it's there (it's possible that it's missing) and create a new column named "dong" that will contain this word. So my data is shown in column "address" and desired output is shown in column "dong" shown below.

address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
dong <- c("탄방동","효동","오정동","자양동",NA)
data <- data.frame(address,dong, stringsAsFactors = FALSE)

I've tried using grep but it's not giving me exactly what I need.

grep(".+동\\s",data$address,value=T)

I think I have 2 issues: 1) I'm not sure how to write a proper regular expression to identify the word I need and 2) I'm not sure why grep returns the whole string rather than the word. I would appreciate any suggestions.

You do not need an additional library, see [this demo](http://ideone.com/Wff1NS). Also, match whole words with a word boundary, not with whitespace. — Wiktor Stribiżew, Apr 02 '17 at 09:55

score 1 · Answer 1 · answered Apr 02 '17 at 09:44

grep returns the whole string. In your case, stringr library is useful.

library(stringr)
str_match(paste0(data$address, ' '), '([^\\s]+동)\\s')
     [,1]      [,2]    
[1,] "탄방동 " "탄방동"
[2,] "효동 "   "효동"  
[3,] "오정동 " "오정동"
[4,] "자양동 " "자양동"
[5,] NA        NA

The column 2 is what you want. Note that I added a space at the end of strings so that regex would match if "dong" appears at the end of string.

Thank you so much, it works great! I appreciate your help. – carpediem Apr 02 '17 at 09:48 — carpediem, Apr 02 '17 at 09:48

score 1 · Accepted Answer · answered Apr 02 '17 at 10:00

A regex to extract Korean whole words ending with a specific letter is

\b\w*동\b

See the regex demo.

Details:

\b- leading word boundary
\w* - 0+ word chars
동 - ending letter
\b - trailing word boundary

See the R demo:

address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
## matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address, perl=TRUE ))
matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address ))
dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x))
data <- data.frame(address,dong, stringsAsFactors = FALSE)

Output:

                                     address   dong
1            대전광역시 서구 탄방동 홈플러스 탄방동
2              대전광역시 동구 효동 주민센터   효동
3          대전광역시 대덕구 오정동 한남마트 오정동
4 대전광역시 동구 자양동 87-3번지 성동경로당 자양동
5               대전광역시 유성구 용계로 128   <NA>

Note that dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x)) line is necessary to add NA to those rows where no match was found.

Thank you very much! This also works great. I really appreciate your detailed explanation. — carpediem, Apr 02 '17 at 12:10

extracting a word (of variable length) ending with 동 from a string in R

2 Answers2