Classification of sparse data

Question

I am struggling with the best choice for a classification/prediction problem. Let me explain the task - I have a database of keywords from abstracts for different research papers, also I have a list of journals with specified impact factors. I want to build a model for article classification based on their keywords, the result is the possible impact factor (taken just as a number without any further journal description) with a given keywords. I removed the unique keyword tags as they do not have much statistical significance so I have only keywords that are repeated 2 and more times in my abstract list (6000 keyword total). I think about dummy coding - for each article I will create a binary feature vector 6000 attributes in length - each attribute refers to presence of the keyword in the abstract and classify the whole set by SVM. I am pretty sure that this solution is not very elegant and probably also not correct, do you have any suggestions for a better deal?

score 0 · Accepted Answer · answered May 28 '16 at 13:29

There is nothing wrong with using this coding strategy for text and support vector machines.

For your actual objective:

support vector regression (SVR) may be more appropriate
beware of the journal impact factor. It is very crude. You need to take temporal aspects into account; and many very good work is not published in journals at all

Classification of sparse data

1 Answers1