I just read the paper Distributed Representations of Sentences and Documents. In the sentiment analysis experiment section, it says, "After learning the vector representations for training sentences and their subphrases, we feed them to a logistic regression to learn a predictor of the movie rating." So it uses logistic regression algorithm as a classifier to determine what the label is.
Then I moved on to dl4j, I read the example "ParagraphVectorsClassifierExample" the code shows as below:
void makeParagraphVectors() throws Exception {
ClassPathResource resource = new ClassPathResource("paravec/labeled");
// build a iterator for our dataset
iterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(resource.getFile())
.build();
tokenizerFactory = new DefaultTokenizerFactory();
tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());
// ParagraphVectors training configuration
paragraphVectors = new ParagraphVectors.Builder()
.learningRate(0.025)
.minLearningRate(0.001)
.batchSize(1000)
.epochs(20)
.iterate(iterator)
.trainWordVectors(true)
.tokenizerFactory(tokenizerFactory)
.build();
// Start model training
paragraphVectors.fit();
}
void checkUnlabeledData() throws IOException {
/*
At this point we assume that we have model built and we can check
which categories our unlabeled document falls into.
So we'll start loading our unlabeled documents and checking them
*/
ClassPathResource unClassifiedResource = new ClassPathResource("paravec/unlabeled");
FileLabelAwareIterator unClassifiedIterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(unClassifiedResource.getFile())
.build();
/*
Now we'll iterate over unlabeled data, and check which label it could be assigned to
Please note: for many domains it's normal to have 1 document fall into few labels at once,
with different "weight" for each.
*/
MeansBuilder meansBuilder = new MeansBuilder(
(InMemoryLookupTable<VocabWord>)paragraphVectors.getLookupTable(),
tokenizerFactory);
LabelSeeker seeker = new LabelSeeker(iterator.getLabelsSource().getLabels(),
(InMemoryLookupTable<VocabWord>) paragraphVectors.getLookupTable());
while (unClassifiedIterator.hasNextDocument()) {
LabelledDocument document = unClassifiedIterator.nextDocument();
INDArray documentAsCentroid = meansBuilder.documentAsVector(document);
List<Pair<String, Double>> scores = seeker.getScores(documentAsCentroid);
/*
please note, document.getLabel() is used just to show which document we're looking at now,
as a substitute for printing out the whole document name.
So, labels on these two documents are used like titles,
just to visualize our classification done properly
*/
log.info("Document '" + document.getLabels() + "' falls into the following categories: ");
for (Pair<String, Double> score: scores) {
log.info(" " + score.getFirst() + ": " + score.getSecond());
}
}
}
It demonstrates how does doc2vec associate arbitrary documents with labels, but it hides the implementations behind the scenes. My question is: is it does so also by logistic regression? if not, what is it? And how can I do it by logistic regression?