I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.
tweets_data:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
formal_lexicon:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
The output should thus look like:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
Should be a simple for loop like:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?
structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc @santa", "When my whole fam tryna have a peaceful holiday " )