I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
-
Clustering is an unsupervised learning task - which means you don't know the 'correct' outputs that your model should predict. So I guess the first question is: how do you exactly want to use your testing data? – dratewka Jul 14 '16 at 11:57
-
I've edited my question (I've added a note ). Please have a look at it. – s.e Jul 14 '16 at 12:04
2 Answers
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.

- 4,742
- 7
- 39
- 70
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

- 124,992
- 159
- 614
- 958