0

I'm currently trying to measure the Jaccard Distance between tweets in a dataset

This is where the dataset is

http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json

I've tried a few things to measure the distance

This is what I have so far

I saved the linked dataset to a file called Tweets.json

json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))

Then I converted json_alldata to tweet.features and got rid of the geo column

# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL

These are what the first two tweets look like

tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

First thing I tried was using the method stringdist which is under the stringdist library

install.packages("stringdist")
library(stringdist)

#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")

When I run that, I get

[1] 0.1621622

I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?

So I figured I could do it by sets. Simply calculate intersection and union and divide

This is what I tried

# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])

When I try to do intersection, I get this: The output is just list()

 Intersection <- intersect(A1, A2)
 list()

When I try Union, I get this:

union(A1, A2)

[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"

[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

This doesn't seem to be grouping the words into a single set.

I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.

Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.

Any help would be appreciated. Thank you.

user3577397
  • 453
  • 3
  • 12
  • 27
  • 2
    `0.1621622` is the distance in terms of `1 - length( intersect(colnames(qgrams(tweet.features$text[1])),colnames(qgrams(tweet.features$text[2]))) ) / length( unique(c(colnames(qgrams(tweet.features$text[1])), colnames(qgrams(tweet.features$text[2])))) )`. You are free to build n-grams or tokenize your tweets. I don't know what you want in the end. – lukeA Apr 01 '16 at 19:36
  • Thank you. So I guess that function I ran is this long line of code? I'm trying to measure the number of words in each tweet, then calculate the Jaccard distance. It would probably be easier if I picked two very different tweets. The ones I picked are very similar to each other. – user3577397 Apr 01 '16 at 21:28

1 Answers1

3

intersect and union expect vectors (as.set does not exist). I think you want to compare words so you can use strsplit but the way the split is done belongs to you. An example below:

tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
                       tweet2=          "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")

jaccard_i <- function(tw1, tw2){
  tw1 <- unlist(strsplit(tw1, " |\\."))
  tw2 <- unlist(strsplit(tw2, " |\\."))
  i <- length(intersect(tw1, tw2))
  u <- length(union(tw1, tw2))
  list(i=i, u=u, j=i/u)
}

jaccard_i(tweet.features[[1]], tweet.features[[2]])

$i
[1] 20

$u
[1] 23

$j
[1] 0.8695652

Is this want you want?

The strsplit is here done for every space or dot. You may want to refine the split argument from strsplit and replace " |\\." for something more specific (see ?regex).

Vincent Bonhomme
  • 7,235
  • 2
  • 27
  • 38
  • Thank you. This looks like it'll be really helpful. I'll test it out tonight. So the strssplit is measuring the spaces, but I guess it's the same thing as counting the number of words? – user3577397 Apr 01 '16 at 21:26
  • 1
    Well, it mostly depends on your question. Not an expert of text analysis but there are some here. – Vincent Bonhomme Apr 01 '16 at 21:48
  • I tested it out. It works really well. What I'm trying to do is cluster tweets with the K-means algorithm based on their Jaccard Distance. So I guess I need to modify it and figure out how to test the Jaccard distance across 251 tweets. – user3577397 Apr 02 '16 at 20:06
  • 1
    nice + you can do it ;-) – Vincent Bonhomme Apr 03 '16 at 06:48
  • We'll see. I hope so :P – user3577397 Apr 03 '16 at 14:06
  • I decided to modify the algorithm you provided a bit, but I'm not sure if I should ask a question here, or start a new question. I basically edited the tweet.features <- list line to select 10 random tweets from the dataset. But I think when I run the Jaccard distance, it's actually calculating the Jaccard by adding up all the tweets instead of comparing two tweets. This is what I changed and left everything else the same tweetText <- list(sample(tweet.features$text, replace = FALSE, size = 5), sample(tweet.features$text, replace = FALSE, size = 5)) – user3577397 Apr 06 '16 at 21:03