You can use the package udpipe to get your POS data. Udpipe automatically tokenizes punctuation.
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section, stringAsFactors = FALSE)
library(udpipe)
library(dplyr)
udmodel <- udpipe_download_model(language = "english")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel,
df$Section)
x <- as.data.frame(x)
x %>% select(token, upos)
token upos
1 If SCONJ
2 an DET
3 infusion NOUN
4 reaction NOUN
5 occurs NOUN
6 , PUNCT
7 interrupt VERB
8 the DET
9 infusion NOUN
10 . PUNCT
Now to combine this the result of a previous question you asked. I took one of the answers.
library(stringr)
library(purrr)
library(tidyr)
df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations) %>%
mutate(POS = purrr::map_chr(tokens, function(x) as.data.frame(udpipe_annotate(udmodel, x = x, tokenizer = "vertical"))$upos))
tokens start end upos
1 If 1 2 SCONJ
2 an 4 5 DET
3 infusion 7 14 NOUN
4 reaction 16 23 NOUN
5 occurs 25 30 NOUN
6 , 31 31 PUNCT
7 interrupt 33 41 VERB
8 the 43 45 DET
9 infusion 47 54 NOUN
10 . 55 55 PUNCT
edit: better solution
But the best solution would be to start from udpipe and then do the rest. Note that I am using stringi instead of stringr package. stringr is based on stringi, but stringi has more options.
x <- udpipe_annotate(udmodel, x = df$Section)
x %>%
as_data_frame %>%
select(token, POSTag = upos) %>% # select needed columns
# add start/end locations
mutate(locations = map(token, function(x) data.frame(stringi::stri_locate(df$Section, fixed = x)))) %>%
unnest
# A tibble: 10 x 4
token POSTag start end
<chr> <chr> <int> <int>
1 If SCONJ 1 2
2 an DET 4 5
3 infusion NOUN 7 14
4 reaction NOUN 16 23
5 occurs NOUN 25 30
6 , PUNCT 31 31
7 interrupt VERB 33 41
8 the DET 43 45
9 infusion NOUN 7 14
10 . PUNCT 55 55