1

I just read the paper Distributed Representations of Sentences and Documents. In the sentiment analysis experiment section, it says, "After learning the vector representations for training sentences and their subphrases, we feed them to a logistic regression to learn a predictor of the movie rating." So it uses logistic regression algorithm as a classifier to determine what the label is.

Then I moved on to dl4j, I read the example "ParagraphVectorsClassifierExample" the code shows as below:

       void makeParagraphVectors()  throws Exception {
         ClassPathResource resource = new ClassPathResource("paravec/labeled");

         // build a iterator for our dataset
         iterator = new FileLabelAwareIterator.Builder()
                 .addSourceFolder(resource.getFile())
                 .build();

         tokenizerFactory = new DefaultTokenizerFactory();
         tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());

         // ParagraphVectors training configuration
         paragraphVectors = new ParagraphVectors.Builder()
                 .learningRate(0.025)
                 .minLearningRate(0.001)
                 .batchSize(1000)
                 .epochs(20)
                 .iterate(iterator)
                 .trainWordVectors(true)
                 .tokenizerFactory(tokenizerFactory)
                 .build();

         // Start model training
         paragraphVectors.fit();
       }

       void checkUnlabeledData() throws IOException {
         /*
         At this point we assume that we have model built and we can check
         which categories our unlabeled document falls into.
         So we'll start loading our unlabeled documents and checking them
        */
        ClassPathResource unClassifiedResource = new ClassPathResource("paravec/unlabeled");
        FileLabelAwareIterator unClassifiedIterator = new FileLabelAwareIterator.Builder()
                .addSourceFolder(unClassifiedResource.getFile())
                .build();

        /*
         Now we'll iterate over unlabeled data, and check which label it could be assigned to
         Please note: for many domains it's normal to have 1 document fall into few labels at once,
         with different "weight" for each.
        */
        MeansBuilder meansBuilder = new MeansBuilder(
            (InMemoryLookupTable<VocabWord>)paragraphVectors.getLookupTable(),
              tokenizerFactory);
        LabelSeeker seeker = new LabelSeeker(iterator.getLabelsSource().getLabels(),
            (InMemoryLookupTable<VocabWord>) paragraphVectors.getLookupTable());

        while (unClassifiedIterator.hasNextDocument()) {
            LabelledDocument document = unClassifiedIterator.nextDocument();
            INDArray documentAsCentroid = meansBuilder.documentAsVector(document);
            List<Pair<String, Double>> scores = seeker.getScores(documentAsCentroid);

            /*
             please note, document.getLabel() is used just to show which document we're looking at now,
             as a substitute for printing out the whole document name.
             So, labels on these two documents are used like titles,
             just to visualize our classification done properly
            */
            log.info("Document '" + document.getLabels() + "' falls into the following categories: ");
            for (Pair<String, Double> score: scores) {
                log.info("        " + score.getFirst() + ": " + score.getSecond());
            }
        }

       }

It demonstrates how does doc2vec associate arbitrary documents with labels, but it hides the implementations behind the scenes. My question is: is it does so also by logistic regression? if not, what is it? And how can I do it by logistic regression?

kevin zhao
  • 11
  • 1

1 Answers1

1

I'm not familiar with DL4J's approach, but at the core 'Paragraph Vector'/'Doc2Vec' level, documents typically have an identifier assigned by the user – most typically, a single unique ID. Sometimes, though, these (provided) IDs have been called "labels", and further, sometimes it can be useful to re-use known-labels as if they were per-document doc-tokens, which can lead to confusion. In the Python gensim library, we call those user-provided tokens "tags" to distinguish from "labels" that might be from a totally different, and downstream, vocabulary.

So in a followup paper like "Document Embedding with Paragraph Vectors", each document has a unique ID - its title or identifer within Wikpedia or Arxiv. But then the resulting doc-vectors are evaluated by how well they place documents with the same category-labels closer to each other than third documents. So there's both a learned doc-tag space, and a downstream evaluation based on other labels (that weren't in any way provided to the unsupervised Paragraph Vector algorithm).

Similarly, you might give all training documents unique IDs, but then later train a separate classifier (of any algorithm) to use the doc=vectors as inputs, and learn to predict other labels. That's my understanding of the IMDB experiment in the original 'Paragraph Vectors' paper: every review has a unique ID during training, and thus got its own doc-vector. But then a downstream classifier was trained to predict positive/negative review sentiment based on those doc-vectors. So, the assessment/prediction of labels ("positive"/"negative") was a separate downstream step.

As mentioned, it's sometimes the case that re-using known category-labels as doc-ids – either as the only doc-ID, or as an extra ID in addition to a unique-per-document ID – can be useful. In a way, it creates synthetic combined documents for training, made up of all documents with the same label. This may tend to influence the final space/coordinates to be more discriminative with regard to the known labels, and thus make the resulting doc-vectors more helpful to downstream classifiers. But then you've replaced classic 'Paragraph Vector', with one ID per doc, with a similar semi-supervised approach where known labels influence training.

gojomo
  • 52,260
  • 14
  • 86
  • 115