R unnest_tokens and calculate positions (start and end location) of each token

Question

How to get the position of all the tokens after using unnest_tokens? Here is a simple example -

df<-data.frame(id=1,
               doc=c("Patient:   [** Name **], [** Name **] Acct.#:         
[** Medical_Record_Number **]        MR #:     [** Medical_Record_Number **]
Location: [** Location **] "))

Tokenize by white space using tidytext -

library(tidytext)
tokens_df<-df %>% 
unnest_tokens(tokens,doc,token = stringr::str_split, 
pattern = "\\s",
to_lower = F, drop = F)

How to get the position of all the tokens?

id  tokens  start  end
 1  Patient: 1      8
 1           9      9
 1  [**      12     14
 1  Name     16     19

Julia Silge · Answer 1 · 2018-01-09T03:47:28.143

I think the first answerer here has the right idea that the best approach is to use string handling, rather than tokenization and NLP, if tokens split on whitespace and character positions is the output you want.

If you also do want to use tidy data principles and end up with a data frame, try out something like this:

library(tidyverse)

df <- data_frame(id=1,
                 doc=c("Patient:   [** Name **], [** Name **] Acct.#:    [** Medical_Record_Number **]    "))

df %>%
  mutate(tokens = str_extract_all(doc, "([^\\s]+)"),
         locations = str_locate_all(doc, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-doc) %>%
  unnest(tokens, locations)

#> # A tibble: 11 x 4
#>       id tokens                start   end
#>    <dbl> <chr>                 <int> <int>
#>  1  1.00 Patient:                  1     8
#>  2  1.00 [**                      12    14
#>  3  1.00 Name                     16    19
#>  4  1.00 **],                     21    24
#>  5  1.00 [**                      26    28
#>  6  1.00 Name                     30    33
#>  7  1.00 **]                      35    37
#>  8  1.00 Acct.#:                  39    45
#>  9  1.00 [**                      50    52
#> 10  1.00 Medical_Record_Number    54    74
#> 11  1.00 **]                      76    78

This will work for multiple documents with id columns for each string, and it is removing actual whitespace from the output because of the way the regex is constructed.

EDITED: In a comment, the original poster asked for an approach that would allow tokenizing by sentence and also keeping track of the positions of each word. The following code does that, in the sense that we get the start and end position for each token within each sentence. Could you use a combination of the sentenceID column with the start and end columns to find what you're looking for?

library(tidyverse)
library(tidytext)

james <- paste0(
  "The question thus becomes a verbal one\n",
  "again; and our knowledge of all these early stages of thought and feeling\n",
  "is in any case so conjectural and imperfect that farther discussion would\n",
  "not be worth while.\n",
  "\n",
  "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
  "for us _the feelings, acts, and experiences of individual men in their\n",
  "solitude, so far as they apprehend themselves to stand in relation to\n",
  "whatever they may consider the divine_. Since the relation may be either\n",
  "moral, physical, or ritual, it is evident that out of religion in the\n",
  "sense in which we take it, theologies, philosophies, and ecclesiastical\n",
  "organizations may secondarily grow.\n"
)

d <- data_frame(txt = james)

d %>%
  unnest_tokens(sentence, txt, token = "sentences") %>%
  mutate(sentenceID = row_number(),
         tokens = str_extract_all(sentence, "([^\\s]+)"),
         locations = str_locate_all(sentence, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-sentence) %>%
  unnest(tokens, locations)

#> # A tibble: 112 x 4
#>    sentenceID tokens   start   end
#>         <int> <chr>    <int> <int>
#>  1          1 the          1     3
#>  2          1 question     5    12
#>  3          1 thus        14    17
#>  4          1 becomes     19    25
#>  5          1 a           27    27
#>  6          1 verbal      29    34
#>  7          1 one         36    38
#>  8          1 again;      40    45
#>  9          1 and         47    49
#> 10          1 our         51    53
#> # ... with 102 more rows

Notice that these aren't quite "tokenized" in the normal sense from unnest_tokens(); they will still have their closing punctuation attached to each word like commas and periods. It seemed like you wanted that from your original question.

Thank you. That makes sense. Huge fan of your package. Is there a way we could tokenize the original doc by sentence first and then tokenize it by words and extract its locations? If we break it by tokens, we lose track of sentence. If we break by sentence first, we lose track of token's locations. — x1carbon, Jan 09 '18 at 03:26
I edited the answer here. Maybe it would help to know *why* you want to know the positions of the tokens? What are you going to do with these columns? — Julia Silge, Jan 09 '18 at 03:48
This makes sense and I could figure this out from your blogs. But I would like to get the locations from the raw text and not the sentences. Why wrangle in this format - most of the bio-nlp annotations are done at a token level and its offset(start position) and length features are calculated. To join the annotation info with raw text, we need document id, sentence id, and token position. — x1carbon, Jan 09 '18 at 04:24

score 0 · Answer 2 · answered Jan 05 '18 at 19:32

Here is the non-tidy approach to the problem.

regex = "([^\\s]+)"
df_i = str_extract_all(df$doc, regex) 
df_ii = str_locate_all(df$doc, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, token = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

R unnest_tokens and calculate positions (start and end location) of each token

2 Answers2