0

I want to classify data into two classes based on parameters given. My data is publications from two different sources and I want to classify it into "match" or "non-match"; when comparing the dataset1 with dataset2. The datasets are unlabeled text data which contain five attributes (id, title, authors, venue, year) so if i apply unsupervised algorithms, it will not produce my target classes. On the other hand, supervised algorithms need to labelled data which is unavailable and time consumed.

  • What is the best and easiest method to do that in python?
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Ocean
  • 53
  • 2
  • 8

1 Answers1

2

The best, easiest and AFAIK the optimal method is as follows:

  1. Use clustering algorithms like K-Means, to cluster your data points into 2 clusters.
  2. Now, manually examine a few samples of one of the cluster and label it accordingly.

Assume you randomly picked 10 data points from the first cluster and they fall in the match class. Now all you need to do is label all the data points in this cluster as match and label all the data points in the other cluster as non-match.

This would give you the required classification.

paradocslover
  • 2,932
  • 3
  • 18
  • 44
  • will I use this method even though I have the `ground truth` data ? – Ocean Dec 10 '20 at 02:37
  • If you have the ground truth data then I don't see why you can't run a classifier. So, no you won't use this method. Instead, you would want to use the classification algorithms. – paradocslover Dec 10 '20 at 02:58
  • thanks for your answers. I'm a new user in ML so does there any resources or topics that I can learn from about applying `ground truth` data to run a classifier. – Ocean Dec 10 '20 at 06:37
  • Consider upvoting and accepting the answer... And this might give you an overview on classification - https://machinelearningmastery.com/types-of-classification-in-machine-learning/ – paradocslover Dec 11 '20 at 02:48