How to split a text into a vector, where each entry corresponds to an index value assigned to each unique word?

Question

Let's say I have a document with some text, like this, from SO:

doc <- 'Questions with similar titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'

I can then make a dataframe where every word has a row in a df:

library(stringi)
dfall <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc))))

We'll add a third column with its unique id. To get the ID, remove duplicates:

library(dplyr)
uniquedf <- distinct(data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc)))))

I'm struggling with how to match the rows against the two dataframes to extract the row index value from uniquedf as a new row value for df

alldf <- alldf %>% mutate(id = which(uniquedf$words == words))

A dply method like this doesn't work.

Is there a more efficient way to do this?

To give an even simpler example to show the expected output, I'd like a dataframe that looks like this:

  words id
1     to  1
2     row  2
3     zip  3
4     zip  3

Where my starting word vector is: doc <- c('to', 'row', 'zip', 'zip') or doc <- c('to row zip zip'). The id column adds a unique id for each unique word.

Maybe add your expected output here if possible. I'm not even sure your question is a dupe, but if you show what you are trying to do, someone can edit your title. — Tim Biegeleisen, Feb 07 '19 at 14:47
Simply `dfall$id <- match(dfall$words, unique(dfall$words))` and avoid the second step all together. Or `dfall$id <- as.numeric(factor(dfall$word))` if the index order doesn't matter. — David Arenburg, Feb 07 '19 at 15:05
@DavidArenburg That is perfect.. and much faster than sapply. I'll accept if you post. — Union find, Feb 07 '19 at 15:08

score 2 · Accepted Answer · edited Feb 07 '19 at 14:58

2

cheap way using sapply

data

doc <- 'Questions with with titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'

function

alldf=cbind(dfall,sapply(1:nrow(dfall),function(x) which(uniquedf$words==dfall$words[x])))

colnames(alldf)=c("words","id")
> alldf
        words id
1   questions  1
2        with  2
3        with  2
4      titles  3
5        have  4
6  frequently  5
7        been  6
8   downvoted  7
9         and  8
10         or  9
11     closed 10
12   consider 11
13      using 12
14          a 13
15      title 14
16       that 15
17       more 16
18 accurately 17
19  describes 18
20       your 19
21   question 20

edited Feb 07 '19 at 14:58

Union find

7,759
13
60
111

answered Feb 07 '19 at 14:51

boski

2,437
1
14
30

looks like with got duplicated there? did you do that for demonstration? – Union find Feb 07 '19 at 14:53
2

yes I did it in purpouse as your example did not have any duplication – boski Feb 07 '19 at 14:54
I am a dope. Thank you. – Union find Feb 07 '19 at 14:55

How to split a text into a vector, where each entry corresponds to an index value assigned to each unique word?

1 Answers1