0

I have a list of negative words which has 4783 elements. I also have another list (dataframe) tf2 with multiple variables "user","reuser", "full_text", "range", "user.location", "date2". I want to compare one column of the multi-variable list with the negative words list.

And, based on the boolean outcome, if the word is present in 'negative and tf2$full_text; I want to create another true or false column in tf2.

I am trying something like this. tf3 <- apply(tf2, function(x) (x$negative <- intersect(x["full_text"], ng)))

But, it is no good. Can we also use something like any(ele in x.full_text.split() for ele in negative) in the function?

I am adding 10 rows from tf2 dataframe as below:

    structure(list(user = c("jdugger2", "rustedshakles", "hhherm", 
"KnightKiwi", "KeithGrayeb", "Clayconboy1", "goblinhunter44", 
"migueli44271514", "hms_smeagol", "owlwoman911_"), reuser = c("TheOnion", 
"TheOnion", "TheOnion", "TheOnion", "TheOnion", "GA_peach3102", 
"TheOnion", "TheOnion", "TheOnion", "SSG_PAIN"), full_text = c("RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @GA_peach3102: A week-long REDUCTION in VIOLENCE between US, Taliban &amp; Afghan forces is set to begin Friday at midnight\n\nThis will lead…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @TheOnion: Taliban Agrees To Peace Deal Despite Concerns About America’s Human-Rights Record .....co/zMTRk7p8J8 .....co/N1KRAX…", 
"RT @SSG_PAIN: ⚡⚡\nUS, Taliban Announce Peace Deal to Be Signed Next Week .....co/5sEqGQw8K5"
), range = c(140L, 140L, 140L, 140L, 140L, 143L, 140L, 140L, 
140L, 95L), user.location = c("Queens, NY", "", "", "Ecruteak City, Johto", 
"", "Arizona, USA", "Gobowen, England", "", "San Francisco", 
"HighRockyNews RT for planet)"), date2 = c(21022020L, 21022020L, 
21022020L, 21022020L, 21022020L, 21022020L, 21022020L, 21022020L, 
21022020L, 21022020L)), row.names = c(NA, 10L), class = "data.frame")

I don't know how to give a negative list of 4783 words here. If we can use an arbitrary list of some 20 negative words. Then, I guess we can test this.

ambrish dhaka
  • 689
  • 7
  • 27
  • a new dataframe is expected with a new column added giving the boolean values, `true` for any word in `full_text` matches any word in list `negative`. – ambrish dhaka Mar 08 '20 at 04:28

1 Answers1

2

Assuming you have vector of words in negative, you can create a pattern with them by pasting them together using paste0 and test it via grepl.

negative <- c('word1', 'word2')
tf2$negative <- grepl(paste0('\\b', negative, '\\b', collapse = '|'), tf2$full_text)

Word boundaries are added to the pattern (\\b) so that "is" doesn't match with "this".

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • thanks! It works. I was forced to bring my data into R from the jupyter-notebook for this reason. PS. https://stackoverflow.com/questions/60576936/find-any-word-of-a-list-in-the-column-of-dataframe – ambrish dhaka Mar 08 '20 at 04:48
  • 1
    I'm sure this can be done similarly in Python as well but I am not well versed with it. – Ronak Shah Mar 08 '20 at 05:00
  • Yes, for a very long list of negative words like 4783 there is a problem. However, when I used a 9 word negative list as `ng2 <- c("hatred", "violence", "kill", "war", "attack", "angry", "admonish", "ban", "sanction")`. It worked fine. My `tf` dataframe as nearly 200,000 rows. – ambrish dhaka Mar 08 '20 at 05:17
  • 1
    Not with the number of rows but it might give you an error when you have lot of words in `negative`. I think there is some limit in length of regex you can have. In such case you can break down `negative` in 2-3 parts separately and then test. – Ronak Shah Mar 08 '20 at 05:18
  • 1
    @ambrishdhaka What I would do is find out the threshold of number where the regex doesn't give an error. So let's say take only 1st 100 words. Try with `negative[1:100]`, `result1 <- grepl(paste0('\\b', negative[1:100], '\\b', collapse = '|'), tf2$full_text)` then next ``result1 <- grepl(paste0('\\b', negative[101:200], '\\b', collapse = '|'), tf2$full_text)` and so on. Finally do `tf2$negative <- result1 | result2 | ....` You can write a loop/lapply/sapply code for doing the same. – Ronak Shah Mar 08 '20 at 06:57
  • Sorry @RonakShah it took me a hell lot of effort to finally figure out that negative-words such as `f**k, bull****, ***hole` simply interrupted the `regex` sequence. We can omit all our discussion. Your code is just perfect. But, as you can understand vetting a list of 4783 elements was not an easy task. – ambrish dhaka Mar 08 '20 at 09:11