1

I am attempting to create a model to measure emotion in text using R. Basically, using a lexicon with emotion words, i want to only extract the 'p' (paragraph) from a large number of URL's. I am looking to find the word-count per emotion per URL, based on presence of pre-defined emotion- indicating words by using a lexicon. Lexicon link

The data I use is in JSON format, from Webrobots: Dataset Link (the latest set).

Any help would be much appreciated, as I am really desperate to get started on this! Even just knowing how i could import this into R and a code to count the words would be of great help.

Kind regards, a desperate R-illiterate girl.

Update: the data file is imported into R. However, I cannot find a way to write a code that tests for the presence of the lexicon-indicated words to run against the data. I seek to create 6 new variables with the counts of each campaign for the six basic emotions (happy, sad, anger, surprise, fear, disgust) that show the word count for the presence of these emotions

The file I have already indicated the paragraph 'p' part at closer look. I just need to categorize it contents.

Rhino
  • 11
  • 5
  • 1
    Welcome to SO, please be a bit more specific when asking question: what have you tried, what do you expect, etc. See [how to ask](http://stackoverflow.com/help/how-to-ask) – Nehal Mar 16 '16 at 13:01

1 Answers1

0

Lexicon list download

  1. The first step for you is to manually download (a simple copy and paste) of the lexicon list from this link and save it in .csv format:

http://www.saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt

Then you need to break down this list into 4 separate parts, each part should have one affect. This will result in 4 .csv files as:

anger_list = w.csv
fear_list  = x.csv
joy_list   = y.csv
sad_list   = z.csv

If you do not want to do this manually, there is an alternative lexicon list where data is directly downloadable into separate files:https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

Text data download

  1. The other link you shared (http://webrobots.io/Kickstarter-datasets/) now seems to have both JSON and csv files and reading it into R seems quite straight forward.

Cleaning of URLs for text extraction

  1. I am not sure of the column/field that you are interested in analysing; as the data set that I downloaded as of February 2019 does not have a field 'p'.

Since your mentioned the presence of URLs, I am also sharing a brief code for possible editing or cleaning of URLs. This will help you get clean textual data out of URLs:

replacePunctuation <- function(x)
{

  # Lowercase all words for convenience
  x <- tolower(x)

  # Remove words with multiple consecutive digits in them (3 in this case) 
  x <- gsub("[a-zA-Z]*([0-9]{3,})[a-zA-Z0-9]* ?", " ", x)

  # Remove extra punctuation
  x <- gsub("[.]+[ ]"," ",x) # full stop
  x <- gsub("[:]+[ ]"," ",x) # Colon
  x <- gsub("[?]"," ",x)     # Question Marks
  x <- gsub("[!]"," ",x)     # Exclamation Marks
  x <- gsub("[;]"," ",x)     # Semi colon
  x <- gsub("[,]"," ",x)     # Comma
  x <- gsub("[']"," ",x)     # Apostrophe
  x <- gsub("[-]"," ",x)     # Hyphen
  x <- gsub("[#]"," ",x)     

  # Remove all newline characters
  x <- gsub("[\r\n]", " ", x)

  # Regex pattern for removing stop words
  stop_pattern <- paste0("\\b(", paste0(stopwords("en"), collapse="|"), ")\\b")
  x <- gsub(stop_pattern, " ", x)

  # Replace whitespace longer than 1 space with a single space
  x <- gsub(" {2,}", " ", x)

  x
}

Code for adding scores on sentiment or affect

  1. Next, I assume you have read your data as text in R. Let's say you have stored it as part of some data frame df$p. The next step then would be to add additional columns to this data frame:

    df$p # contains text of interest
    

Now add additional columns to this data frame for each of the four affects

df$ANGER   = 0
df$FEAR    = 0
df$JOY     = 0
df$SADNESS = 0

Then you simply loop through each row of df, breaking down the text p into words based on white space. Then you look for the occurrence of specific terms from your Lexicon list into the stripped words you got. You then assign scores for each affect as below:

for (i in 1:nrow(df))
{
  # counter initialization
  angry = 0
  feared = 0
  joyful = 0
  sad = 0

# for df, let's say the text 'p' is at first column place  
words <- strsplit(df[i,1], " ")[[1]]  
  for (j in 1:length(words))
  {
    if (words[j] %in% anger_list[,1])
      angry = angry + 1
    else {
      if (words[j] %in% fear_list[,1])   
        feared = feared + 1
      else { 
        if (words[j] %in% joy_list[,1])
          joyful = joyful + 1
        else
          sad = sad + 1
      } #else 2
    } #else 1
  } #for 2

  df[i,2] <- angry
  df[i,3] <- feared
  df[i,4] <- joyful
  df[i,5] <- sad

}#for 1

Please note that, in the above implementation I'm assuming a word can only represent one affect at a time. Meaning that I assume these affects are mutually exclusive. However, I understand that for some of the terms in your text 'p', this might not be true so you should modify your code to incorporate for having multiple affects per term.

Sandy
  • 1,100
  • 10
  • 18