Lexicon list download
- The first step for you is to manually download (a simple copy and paste) of the lexicon list from this link and save it in .csv format:
http://www.saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt
Then you need to break down this list into 4 separate parts, each part should have one affect. This will result in 4 .csv files as:
anger_list = w.csv
fear_list = x.csv
joy_list = y.csv
sad_list = z.csv
If you do not want to do this manually, there is an alternative lexicon list where data is directly downloadable into separate files:https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
Text data download
- The other link you shared (http://webrobots.io/Kickstarter-datasets/) now seems to have both JSON and csv files and reading it into R seems quite straight forward.
Cleaning of URLs for text extraction
- I am not sure of the column/field that you are interested in analysing; as the data set that I downloaded as of February 2019 does not have a field 'p'.
Since your mentioned the presence of URLs, I am also sharing a brief code for possible editing or cleaning of URLs. This will help you get clean textual data out of URLs:
replacePunctuation <- function(x)
{
# Lowercase all words for convenience
x <- tolower(x)
# Remove words with multiple consecutive digits in them (3 in this case)
x <- gsub("[a-zA-Z]*([0-9]{3,})[a-zA-Z0-9]* ?", " ", x)
# Remove extra punctuation
x <- gsub("[.]+[ ]"," ",x) # full stop
x <- gsub("[:]+[ ]"," ",x) # Colon
x <- gsub("[?]"," ",x) # Question Marks
x <- gsub("[!]"," ",x) # Exclamation Marks
x <- gsub("[;]"," ",x) # Semi colon
x <- gsub("[,]"," ",x) # Comma
x <- gsub("[']"," ",x) # Apostrophe
x <- gsub("[-]"," ",x) # Hyphen
x <- gsub("[#]"," ",x)
# Remove all newline characters
x <- gsub("[\r\n]", " ", x)
# Regex pattern for removing stop words
stop_pattern <- paste0("\\b(", paste0(stopwords("en"), collapse="|"), ")\\b")
x <- gsub(stop_pattern, " ", x)
# Replace whitespace longer than 1 space with a single space
x <- gsub(" {2,}", " ", x)
x
}
Code for adding scores on sentiment or affect
Next, I assume you have read your data as text in R. Let's say you have stored it as part of some data frame df$p. The next step then would be to add additional columns to this data frame:
df$p # contains text of interest
Now add additional columns to this data frame for each of the four affects
df$ANGER = 0
df$FEAR = 0
df$JOY = 0
df$SADNESS = 0
Then you simply loop through each row of df, breaking down the text p into words based on white space. Then you look for the occurrence of specific terms from your Lexicon list into the stripped words you got. You then assign scores for each affect as below:
for (i in 1:nrow(df))
{
# counter initialization
angry = 0
feared = 0
joyful = 0
sad = 0
# for df, let's say the text 'p' is at first column place
words <- strsplit(df[i,1], " ")[[1]]
for (j in 1:length(words))
{
if (words[j] %in% anger_list[,1])
angry = angry + 1
else {
if (words[j] %in% fear_list[,1])
feared = feared + 1
else {
if (words[j] %in% joy_list[,1])
joyful = joyful + 1
else
sad = sad + 1
} #else 2
} #else 1
} #for 2
df[i,2] <- angry
df[i,3] <- feared
df[i,4] <- joyful
df[i,5] <- sad
}#for 1
Please note that, in the above implementation I'm assuming a word can only represent one affect at a time. Meaning that I assume these affects are mutually exclusive. However, I understand that for some of the terms in your text 'p', this might not be true so you should modify your code to incorporate for having multiple affects per term.