How to find similarity for large number of features

Question

I'm not sure if I am asking the question at right place as I'm new to stackoverflow, please move if required.

I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.

I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide. The problem is with the number of features. I cannot load the file in Weka (To try to reduce features by some info gain or something and then try clustering or check if cosine similarity measure)

One more problem is, how to define this as a classification problem ? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.

I want to create a test data set with some of the nodes and and create similar table and label them as positive class or negative class. But my problem is all data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly ?

Any pointers or help is very much appreciated.

score 0 · Answer 1 · answered Apr 22 '13 at 13:14

0

Weka can deal with 27K features, it shoudn't be a problem... However, I would approach this problem as a classification problem, but a link-discovery one, which, in this case can be seen as a matching problem.

My approach would be: 1. new node appears 2. search for the most similar elements 3. assume they are related (there is a link) if the similarity is greater than your threshold.

The main problem would be to tune the threshold based on some quality measure.

For this approach Lucene would be probably the best option.

I hope this helps.

answered Apr 22 '13 at 13:14

miguelmalvarez

920
6
15

I could not load the file itself in Weka. Values of all the attributes are numbers. There are 5K records. File size is around 250MB. I'm not sure if I'm missing something here ? – TechCrunch Apr 22 '13 at 14:53
Thank you so much for the response. I could not load the file itself in Weka. Values of all the attributes are numbers. There are 5K records. File size is around 250MB. I'm not sure if I'm missing something? Is the approach that you mentioned similar to mine? I couldn't get how to define this as a classification problem, as in what would training dataset contain ? Each row will have a pair of nodes (4K x 4K) and their features ( or reduced features) and will be classified Yes or NO ? In training set, all are classified as Yes as I already know. So, will it ever classify NO ? – TechCrunch Apr 22 '13 at 14:58
About Weka, could you provide more information about the error you get (if you get any)? Is there any reason why you have to address this task as a classification problem? Because I don't think is fit for it and you have better alternatives. – miguelmalvarez Apr 22 '13 at 15:47
there is no specific reason on why I want to do as classification. I read a paper on Link Prediction with Supervised Learning and they do it as classfication problem. It is similar to this, but I couldn't get how to relate. Regarding Weka, there is no error as such, but the file would never load, it keeps saying Reading file, May be because the file is too large ? So, if we are able to load in Weka and not doing classification, can we use something in Weka to predict a link ? I'm unable to understand how we can use weka for this purpose. Can we get a similarity of all nodes ? – TechCrunch Apr 22 '13 at 17:59
Can you provide a link to the paper? The way you could apply it as a classification problem is to see each document as a "class". Then you calculate if you want to classify (link) any new document to each one of them. However, this would be very inefficient in your case. Weka (the API) can be used to compute cosine similarities, but it is not the best option. You should divide your question in two: 1. How to solve the task. I would say use just the matching and similarity using Lucene. But you can try the classification way as I said before. 2. Problem with Weka in the Weka section of SO – miguelmalvarez Apr 22 '13 at 18:42
Thank you so much for helping me on this. Link is [here](http://www.siam.org/meetings/sdm06/workproceed/Link%20Analysis/12.pdf) – TechCrunch Apr 22 '13 at 19:29
By any chance, did you get on how they were doing this as classification problem ? Thank you so much for help. – TechCrunch Apr 23 '13 at 14:58
The paper does not classify documents,but links.Starts with a train set of documents and links. In your case this could mean up to 5K*5K links (if every author co-author with everyone else). how many links you have in training? Each link/point (doc1, doc2) is define as a set of features similar to: cosine similarity, number of common keywords, number of authors, sum of co-citation between them... etc Say around 50 features. Then that is feed to the classifier. This means: 1. You have to compute (somehow) the feature weights. 2. Then represent every train link 3. Feed it to Weka 4. Classify – miguelmalvarez Apr 23 '13 at 15:21
I have total of 400K known edges with me right now. I plan to take 10% of those edges and try to predict them. So other 90% will be training. So, the features will be no. of common tags, common labels etc ? My concern was so with all these details, I would mark each row as positive class in training set, so how would I ever classify as No ? Are you saying that, I have to keep some links in the training set which are classified as negative class also ?? This is my main concern on using this in weka. – TechCrunch Apr 23 '13 at 21:01
1. Yes, you should have examples of NO links as well 2. 360,000 (your train set) sounds like way too much for Weka, but I'm not sure... Best of luck :) – miguelmalvarez Apr 24 '13 at 10:41

How to find similarity for large number of features

1 Answers1