1

I have been trying to write a function or use the apply family to select the rows in a data frame that contain the words I'm looking for and mark them like a tag. A row can have several tags. Can someone please help me, I have been stuck for a while.

If my question is unclear or if there is an answer somewhere else please guide me in the right direction. Much appreciated!

require(stringr)
require(dplyr)
df <- data.frame(sentences, rnorm(length(sentences)))

old = df %>% filter(str_detect(sentences, 'old')) %>% mutate(w = factor("old"))
new = df %>% filter(str_detect(sentences, 'new')) %>% mutate(w = factor("new"))
boy = df %>% filter(str_detect(sentences, 'boy')) %>% mutate(w = factor("boy"))
girl = df %>% filter(str_detect(sentences, 'girl')) %>% mutate(w = factor("girl"))
tags <- bind_rows(old, new, boy, girl)

So i want to choose a finite number of words for example:

tags <- c('bananas', 'apples', oranges)

And I want the result to be a data.frame with new columns for every word I have chosen. If the row contains one of the words I have chosen, the column for that words should be TRUE och marked somehow. Something like

Sentences     bananas     apples     oranges  
sentence1     TRUE        
sentence2                 TRUE
sentence3     TRUE
sentence4                            TRUE
sentence5                 TRUE       TRUE

or

Sentences     tag1        tag2
sentence1     bananas        
sentence2     apples
sentence3     bananas
sentence4     oranges
entences5     apples      oranges

Or something like that. Please let me know if I can explain more clearly.

  • 1
    What is the final solution you are looking for? Conceptually, what would it be able to do? – A. Stam Dec 13 '17 at 13:00
  • Is there a finite, known amount of words you are trying to tag? – LAP Dec 13 '17 at 13:44
  • I tried to explain a bit more, the number of words are finite yes and I want every row to be tagged if it contains any of the words. I don't know what might be good, either a column for every word or tag #1 #2 #3 up to the max(nr of tags). – CluelessCoder Dec 14 '17 at 16:24
  • Can you provide `sentences`? – acylam Dec 14 '17 at 16:31

1 Answers1

0

Do you really want to use the apply function? I'm pretty sure the tm package is what you're looking for. This is the easiest and more robust way. With the DocumentTermMatrix function you can get what you want. I have elaborated some sentences on my own (with a high syntactic level). The easiest way is to proceed with all the words and once you have the matrix select those columns of the words you want to find.

sentence1 <- "This is a bananana"
sentence2 <- "This is an apple"
sentence3 <- "This is a watermelon and a banana"
sentence4 <- "This is a watermelon a banana an apple"

df_sentence <- rbind(sentence1, sentence2, sentence3, sentence4)

library(tm)
vs_sentence <- VectorSource(df_sentence)
vc_sentence <- VCorpus(vs_sentence)

clean_sentence <- tm_map(vc_sentence, removePunctuation)
dtm_sentence <- DocumentTermMatrix(clean_sentence)
as.matrix(dtm_sentence)

The result:

        Terms
Docs and apple banana this watermelon
   1   0     0      1    1          0
   2   0     1      0    1          0
   3   1     0      1    1          1
   4   0     1      1    1          1

Also there is another function that allows you to obtain documents by columns and terms by rows:

as.matrix(TermDocumentMatrix(clean_sentence))
            Docs
Terms        1 2 3 4
  and        0 0 1 0
  apple      0 1 0 1
  banana     1 0 1 1
  this       1 1 1 1
  watermelon 0 0 1 1

If you could provide a part of your sentences maybe it would be easier to give you a better solution. HTH!

Tito Sanz
  • 1,280
  • 1
  • 16
  • 33