R POS tagging and tokenizing in one go

Question

I have a text as below.

   Section <- c("If an infusion reaction occurs, interrupt the infusion.")
    df <- data.frame(Section)

When I tokenize using tidytext and the code below,

AA <- df %>%
  mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
         locations = str_locate_all(df$Section, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations)

It gives me the tokens, the start and end position. How do I obtain the POS tags while unnesting at the same time. Something as below (the POStags may not be correct in the image below)

Just use the udpipe R package: example vignette at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html — , Aug 16 '18 at 07:12

phiver · Answer 1 · 2018-08-16T10:02:05.257

You can use the package udpipe to get your POS data. Udpipe automatically tokenizes punctuation.

Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section, stringAsFactors = FALSE)

library(udpipe)
library(dplyr)
udmodel <- udpipe_download_model(language = "english")
udmodel <- udpipe_load_model(file = udmodel$file_model)


x <- udpipe_annotate(udmodel, 
                     df$Section)
x <- as.data.frame(x)

x %>% select(token, upos)
       token  upos
1         If SCONJ
2         an   DET
3   infusion  NOUN
4   reaction  NOUN
5     occurs  NOUN
6          , PUNCT
7  interrupt  VERB
8        the   DET
9   infusion  NOUN
10         . PUNCT

Now to combine this the result of a previous question you asked. I took one of the answers.

library(stringr)
library(purrr)
library(tidyr)

df %>% mutate(
  tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
  locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
  locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations) %>% 
  mutate(POS = purrr::map_chr(tokens, function(x) as.data.frame(udpipe_annotate(udmodel, x = x, tokenizer = "vertical"))$upos))

       tokens start end  upos
1         If     1   2 SCONJ
2         an     4   5   DET
3   infusion     7  14  NOUN
4   reaction    16  23  NOUN
5     occurs    25  30  NOUN
6          ,    31  31 PUNCT
7  interrupt    33  41  VERB
8        the    43  45   DET
9   infusion    47  54  NOUN
10         .    55  55 PUNCT

edit: better solution

But the best solution would be to start from udpipe and then do the rest. Note that I am using stringi instead of stringr package. stringr is based on stringi, but stringi has more options.

x <- udpipe_annotate(udmodel, x = df$Section)

x %>% 
  as_data_frame %>% 
  select(token, POSTag = upos) %>% # select needed columns
  # add start/end locations
  mutate(locations = map(token, function(x) data.frame(stringi::stri_locate(df$Section, fixed = x)))) %>% 
  unnest

  # A tibble: 10 x 4
   token     POSTag start   end
   <chr>     <chr>  <int> <int>
 1 If        SCONJ      1     2
 2 an        DET        4     5
 3 infusion  NOUN       7    14
 4 reaction  NOUN      16    23
 5 occurs    NOUN      25    30
 6 ,         PUNCT     31    31
 7 interrupt VERB      33    41
 8 the       DET       43    45
 9 infusion  NOUN       7    14
10 .         PUNCT     55    55

FYI. You can also use the udpipe R package to get parts of speech tags for already tokenised text. Example is shown at https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html#annotate_your_text where it says 'My text data is already tokenised'. For this use the argument tokenizer = 'vertical'. That avoids doing crazy things with a join as you get the same number of rows in the same order back as the number of tokens you provide. — , Aug 16 '18 at 07:09
@jwijffels, I had forgotten about this part of udpipe. I changed the answer a bit to show this part. And I added a start from udpipe. as that reduces the code part a lot. — phiver, Aug 16 '18 at 10:02

score 3 · Answer 2 · answered Sep 25 '18 at 12:59

FYI. Since udpipe version 0.7 on CRAN, you can just do as follows.

library(udpipe)
x <- data.frame(doc_id = c("doc1", "doc2"),
                text = c("If an infusion reaction occurs, interrupt the infusion.",
                         "Houston we have a problem"))
x <- udpipe(x, "english")
x

which gives you (notice the start/end as well as the token/upos/xpos which you are looking for):

 doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos xpos                                      feats head_token_id   dep_rel deps            misc
   doc1            1           1     1   2       1        1        If        if SCONJ   IN                                       <NA>             7      mark <NA>            <NA>
   doc1            1           1     4   5       2        2        an         a   DET   DT                  Definite=Ind|PronType=Art             5       det <NA>            <NA>
   doc1            1           1     7  14       3        3  infusion  infusion  NOUN   NN                                Number=Sing             4  compound <NA>            <NA>
   doc1            1           1    16  23       4        4  reaction  reaction  NOUN   NN                                Number=Sing             5  compound <NA>            <NA>
   doc1            1           1    25  30       5        5    occurs     occur  NOUN  NNS                                Number=Plur             7     nsubj <NA>   SpaceAfter=No
   doc1            1           1    31  31       6        6         ,         , PUNCT    ,                                       <NA>             7     punct <NA>            <NA>
   doc1            1           1    33  41       7        7 interrupt interrupt  VERB   VB                      Mood=Imp|VerbForm=Fin             0      root <NA>            <NA>
   doc1            1           1    43  45       8        8       the       the   DET   DT                  Definite=Def|PronType=Art             9       det <NA>            <NA>
   doc1            1           1    47  54       9        9  infusion  infusion  NOUN   NN                                Number=Sing             7       obj <NA>   SpaceAfter=No
   doc1            1           1    55  55      10       10         .         . PUNCT    .                                       <NA>             7     punct <NA> SpacesAfter=\\n
   doc2            1           1     1   7       1        1   Houston   Houston PROPN  NNP                                Number=Sing             0      root <NA>            <NA>
   doc2            1           1     9  10       2        2        we        we  PRON  PRP Case=Nom|Number=Plur|Person=1|PronType=Prs             3     nsubj <NA>            <NA>
   doc2            1           1    12  15       3        3      have      have  VERB  VBP           Mood=Ind|Tense=Pres|VerbForm=Fin             1 parataxis <NA>            <NA>
   doc2            1           1    17  17       4        4         a         a   DET   DT                  Definite=Ind|PronType=Art             5       det <NA>            <NA>
   doc2            1           1    19  25       5        5   problem   problem  NOUN   NN                                Number=Sing             3       obj <NA> SpacesAfter=\\n

score 0 · Answer 3 · answered Aug 15 '18 at 15:37

Like the previous answerer, I think that udpipe is likely the easiest way to go for POS tagging. My favorite way to interact with udpipe is via the cleanNLP package. After the initializing function is called, it is just two lines of code to get the udpipe output.

library(tidyverse)
library(cleanNLP)

cnlp_init_udpipe()
#> Loading required namespace: udpipe

df <- data_frame(id = 1,
                 text = c("If an infusion reaction occurs, interrupt the infusion."))

cnlp_annotate(df) %>%
  cnlp_get_tif()
#> # A tibble: 10 x 19
#>    id      sid   tid word  lemma upos  pos     cid   pid definite mood 
#>    <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl> <int> <chr>    <chr>
#>  1 1         1     1 If    if    SCONJ IN        0     1 <NA>     <NA> 
#>  2 1         1     2 an    a     DET   DT        3     1 Ind      <NA> 
#>  3 1         1     3 infu… infu… NOUN  NN        6     1 <NA>     <NA> 
#>  4 1         1     4 reac… reac… NOUN  NN       15     1 <NA>     <NA> 
#>  5 1         1     5 occu… occur NOUN  NNS      24     1 <NA>     <NA> 
#>  6 1         1     6 ,     ,     PUNCT ,        30     1 <NA>     <NA> 
#>  7 1         1     7 inte… inte… VERB  VB       32     1 <NA>     Imp  
#>  8 1         1     8 the   the   DET   DT       42     1 Def      <NA> 
#>  9 1         1     9 infu… infu… NOUN  NN       46     1 <NA>     <NA> 
#> 10 1         1    10 .     .     PUNCT .        54     1 <NA>     <NA> 
#> # ... with 8 more variables: number <chr>, pron_type <chr>,
#> #   verb_form <chr>, source <int>, relation <chr>, word_source <chr>,
#> #   lemma_source <chr>, spaces <dbl>

Created on 2018-08-15 by the reprex package (v0.2.0).

R POS tagging and tokenizing in one go

3 Answers3