Snapshot of the dataset:
I'm getting following chart:
Here is the code:
library(tidytext)
library(syuzhet)
lyrics$lyric <- as.character(lyrics$lyric)
tidy_lyrics <- lyrics %>%
unnest_tokens(word,lyric)
song_wrd_count <- tidy_lyrics %>% count(track_title)
lyric_counts <- tidy_lyrics %>%
left_join(song_wrd_count, by = "track_title") %>%
rename(total_words=n)
lyric_sentiment <- tidy_lyrics %>%
inner_join(get_sentiments("nrc"),by="word")
lyric_sentiment %>%
count(word,sentiment,sort=TRUE) %>%
group_by(sentiment)%>%top_n(n=10) %>%
ungroup() %>%
ggplot(aes(x=reorder(word,n),y=n,fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment,scales="free") +
coord_flip()
The issue is that I'm not sure if the result I'm getting is correct or not. For instance, you can see 'bad' is part of multiple emotions. Also, if we inspect lyric_sentiment
, we'd see that word 'shame' is present four times for 'Tim McGraw'. In reality it appears only twice in this song.
What's the right approach?