1

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).

Data is: title & abstract (mean=1300 characters)

Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.

Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..

Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?

Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!

Ramon Wenzel
  • 69
  • 1
  • 7
  • how do you define relevancy? considering only 1% of the corpus for training is not reasonable. Do you have annotation for your corpus? I mean relevant/irrelevant label for each document. – Wasi Ahmad Nov 24 '16 at 04:36

1 Answers1

1

Several Ideas:

  1. Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.

  2. Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.

  3. If you had the citations you could run label propagation over the network graph by labelling very few papers.

  4. Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.

  5. Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.

  6. If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.

Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Ash
  • 3,428
  • 1
  • 34
  • 44