R: find words from tweets in Lexicon, count them and save number in dataframe with tweets

Question

I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.

tweets_data:

                   Content            
1                 "Blablabla"               
2                 "Hi my name is"               
3                 "Yes I need"                 
.  
.
. 
50176            "TEXT50176"

formal_lexicon:

                       X            
1                 "admittedly"               
2                 "Consequently"               
3                 "Furthermore"                 
.  
.
. 
1000000            "meanwhile"

The output should thus look like:

                  Content             Lexicon
1                 "TEXT1"                1
2                 "TEXT2"                3
3                 "TEXT3"                0 
.  
.
. 
50176            "TEXT50176"             2

Should be a simple for loop like:

for(sentence in tweets_data$Content){ 
  for(word in sentence){
    if(word %in% formal_lexicon){
         ...
}
}
}

I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?

structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")

c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

Could you add an usable (also fake) example of your data and lexicon? — s__, Jul 28 '21 at 14:20

s__ · Answer 1 · 2021-07-28T15:12:26.810

1

You can try something like this:

library(tidytext)
library(dplyr)

# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))


tweets_data_df  %>% 
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)

With result:

Joining, by = "id"
# A tibble: 6 x 3
     id Content                                                              cnt
  <int> <chr>                                                              <dbl>
1     1 "@barackobama Thank you for your incredible grace in leadership a~     0
2     2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~     0
3     3 "2017 resolution: to embody authenticity!"                             0
4     4 "Happy Holidays! Sending love and light to every corner of the ea~     0
5     5 "Damn, it's hard to wrap presents when you're drunk. cc @santa"        0
6     6 "When my whole fam tryna have a peaceful holiday "                     0

edited Jul 28 '21 at 15:12

answered Jul 28 '21 at 14:27

s__

9,270
3
27
45

I get this error.. Error in UseMethod("pull") : no applicable method for 'pull' applied to an object of class "character" – Ja123 Jul 28 '21 at 14:34
Is it a problem with my code and data, or a problem with my code but other data? – s__ Jul 28 '21 at 14:37
With your code and other data. But if I do your code and your data I also get an error: Error: `by` must be supplied when `x` and `y` have no common variables. ℹ use by = character()` to perform a cross-join. Run `rlang::last_error()` to see where the error occurred. `by` must be supplied when `x` and `y` have no common variables. ℹ use by = character()` to perform a cross-join. – Ja123 Jul 28 '21 at 14:39
For my code and data, I put the correct code on the last line: `select(...)`. For your data, you need to share some of the data that create the error editing the question. Also a `dput(head(formal_lexicon))` and `dput(head(tweets_data))` are ok (you have to post the output): post those result, because if the data are the problem, only looking them I can help you in how to make it works. – s__ Jul 28 '21 at 14:44
dput(head(formal_lexicon)): `structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")` – Ja123 Jul 28 '21 at 15:04
dput(head(tweets_data)): `c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc @santa", "When my whole fam tryna have a peaceful holiday " )` – Ja123 Jul 28 '21 at 15:07
1

@Ja123 I see the problem, you need to convert your `tweets_data` in a `data.frame` then put it in the code. See the edit. Obviously none of the words of the data are in the lexicon so you see 0s as results. – s__ Jul 28 '21 at 15:13
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235391/discussion-between-ja123-and-s). – Ja123 Jul 28 '21 at 15:18

score 0 · Answer 2 · answered Jul 28 '21 at 14:45

Hope this is useful for you:

library(magrittr)
library(dplyr)
library(tidytext)

# Data frame with tweets, including an ID
tweets <- data.frame(
  id = 1:3,
  text = c(
    'Hello, this is the first tweet example to your answer',
    'I hope that my response help you to do your task',
    'If it is tha case, please upvote and mark as the correct answer'
  )
)

lexicon <- data.frame(
  word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)


# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
  tidytext::unnest_tokens(output = 'words', input = text) %>% 
# Determining if a word is in your lexicon
  dplyr::mutate(
    in_lexicon = words %in% lexicon$word
  ) %>% 
  dplyr::group_by(id) %>%
  dplyr::summarise(words_in_lexicon = sum(in_lexicon))

# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

I get this error (I took 50 observations from the big dataset to check if it works): `Error: Must extract column with a single valid subscript. x Subscript `var` has size 50 but must be size 1.` — Ja123, Jul 28 '21 at 15:08
¿That error is when running my reprex or with your actual data? — Johan Rosa, Jul 28 '21 at 15:20

R: find words from tweets in Lexicon, count them and save number in dataframe with tweets

2 Answers2