I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!