0

Relevant files:

biggie

positive

I'm working on some natural language processing and am trying to check if a word in one list is in another using the %in% check. Problem is, it returns everything as FALSE when I know there should be at least a few TRUE returns. I'm wondering if the issue is with the type of objects I am working with? Though when I run tests everything is a character object so I thought this shouldn't be an issue. Here is my code:

library(dplyr)
library(tokenizers)
library(tidytext)

biggie <- read.csv("C:/Users/My.Name/Desktop/biggie.csv", stringsAsFactors=FALSE)

colnames(biggie)[1] <- 'biggie'



bigsplit <- biggie %>% 
  unnest_tokens(word, biggie)

pos <- read.csv("C:/Users/My.Name/Desktop/positive.csv", stringsAsFactors = FALSE)

positive <- function(data){
  pos_count <- 0
  for(i in 1:nrow(data)){
    if (data[i,1] %in% pos){
      pos_count = pos_count + 1
    }
  }
  return(pos_count/nrow(data)
}

Here I found a workaround, but I feel like it adds unnecessary loops/steps into the function and takes a lot more computing power than I would like it to:

#Tests
bigsplit[1,1] = "abound"
bigsplit[1,1] %in% pos #Returns FALSE, but I would expect TRUE
bigsplit[1,1] %in% pos[1,1] #Returns TRUE

#NEW FUNCTION
positive <- function(data){
  pos_count = 0
  for(i in 1:nrow(data)){
    match_this <- data[i,1]
    for(i in 1:nrow(pos)){
      if(match_this %in% pos[i,1]){
        pos_count <- pos_count + 1
      }
    }
  }
  return(pos_count/nrow(data))
}

If anyone has any tips on these issues, I would really appreciate hearing them. Thanks!

phiver
  • 23,048
  • 14
  • 44
  • 56
  • 1
    You will get help **much** faster if you create a small copy/pasteable example that doesn't require people downloading files and reading them in to see what's going on. Just use `dput()` on a few rows of your data that shows the problem. – Gregor Thomas Apr 24 '18 at 15:28
  • 1
    Your actual problem is that `pos` is a `dataframe` so this line; `if (data[i,1] %in% pos)`, will not work as you don't specify the column to match against. You need to match against a vector so change `pos` to `pos$col_name` for example – Relasta Apr 24 '18 at 15:41
  • Awesome, this is exactly it. Thank you! – lostpineapple45 Apr 24 '18 at 15:50

0 Answers0