I am struggling with the best choice for a classification/prediction problem. Let me explain the task - I have a database of keywords from abstracts for different research papers, also I have a list of journals with specified impact factors. I want to build a model for article classification based on their keywords, the result is the possible impact factor (taken just as a number without any further journal description) with a given keywords. I removed the unique keyword tags as they do not have much statistical significance so I have only keywords that are repeated 2 and more times in my abstract list (6000 keyword total). I think about dummy coding - for each article I will create a binary feature vector 6000 attributes in length - each attribute refers to presence of the keyword in the abstract and classify the whole set by SVM. I am pretty sure that this solution is not very elegant and probably also not correct, do you have any suggestions for a better deal?
Asked
Active
Viewed 560 times
1 Answers
0
There is nothing wrong with using this coding strategy for text and support vector machines.
For your actual objective:
- support vector regression (SVR) may be more appropriate
- beware of the journal impact factor. It is very crude. You need to take temporal aspects into account; and many very good work is not published in journals at all

Has QUIT--Anony-Mousse
- 76,138
- 12
- 138
- 194