I think the first answerer here has the right idea that the best approach is to use string handling, rather than tokenization and NLP, if tokens split on whitespace and character positions is the output you want.
If you also do want to use tidy data principles and end up with a data frame, try out something like this:
library(tidyverse)
df <- data_frame(id=1,
doc=c("Patient: [** Name **], [** Name **] Acct.#: [** Medical_Record_Number **] "))
df %>%
mutate(tokens = str_extract_all(doc, "([^\\s]+)"),
locations = str_locate_all(doc, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-doc) %>%
unnest(tokens, locations)
#> # A tibble: 11 x 4
#> id tokens start end
#> <dbl> <chr> <int> <int>
#> 1 1.00 Patient: 1 8
#> 2 1.00 [** 12 14
#> 3 1.00 Name 16 19
#> 4 1.00 **], 21 24
#> 5 1.00 [** 26 28
#> 6 1.00 Name 30 33
#> 7 1.00 **] 35 37
#> 8 1.00 Acct.#: 39 45
#> 9 1.00 [** 50 52
#> 10 1.00 Medical_Record_Number 54 74
#> 11 1.00 **] 76 78
This will work for multiple documents with id
columns for each string, and it is removing actual whitespace from the output because of the way the regex is constructed.
EDITED:
In a comment, the original poster asked for an approach that would allow tokenizing by sentence and also keeping track of the positions of each word. The following code does that, in the sense that we get the start and end position for each token within each sentence. Could you use a combination of the sentenceID
column with the start
and end
columns to find what you're looking for?
library(tidyverse)
library(tidytext)
james <- paste0(
"The question thus becomes a verbal one\n",
"again; and our knowledge of all these early stages of thought and feeling\n",
"is in any case so conjectural and imperfect that farther discussion would\n",
"not be worth while.\n",
"\n",
"Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
"for us _the feelings, acts, and experiences of individual men in their\n",
"solitude, so far as they apprehend themselves to stand in relation to\n",
"whatever they may consider the divine_. Since the relation may be either\n",
"moral, physical, or ritual, it is evident that out of religion in the\n",
"sense in which we take it, theologies, philosophies, and ecclesiastical\n",
"organizations may secondarily grow.\n"
)
d <- data_frame(txt = james)
d %>%
unnest_tokens(sentence, txt, token = "sentences") %>%
mutate(sentenceID = row_number(),
tokens = str_extract_all(sentence, "([^\\s]+)"),
locations = str_locate_all(sentence, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-sentence) %>%
unnest(tokens, locations)
#> # A tibble: 112 x 4
#> sentenceID tokens start end
#> <int> <chr> <int> <int>
#> 1 1 the 1 3
#> 2 1 question 5 12
#> 3 1 thus 14 17
#> 4 1 becomes 19 25
#> 5 1 a 27 27
#> 6 1 verbal 29 34
#> 7 1 one 36 38
#> 8 1 again; 40 45
#> 9 1 and 47 49
#> 10 1 our 51 53
#> # ... with 102 more rows
Notice that these aren't quite "tokenized" in the normal sense from unnest_tokens()
; they will still have their closing punctuation attached to each word like commas and periods. It seemed like you wanted that from your original question.