How to subdivide the documents into sentences before Training Mallet LDA

Question

Do you guys have any suggestion for the way that I could possibly subdivide documents into sentences before training MALLET LDA?

Thank you in advance

score 1 · Answer 1 · answered Feb 26 '18 at 19:06

Depending on you definition of sentence this can be done in Java using String.split("\\.\\s"). Assuming that a user ends a sentence with a period and begins a new one with whitespace. The period is escaped since the parameter of split is a regex. The \\s means "any whitespace", which will also take care of end of line and tabs.

String test = "Hello. World. Cat eats dog.";
String[] splitString = test.split("\\.\\s");

The content of splitString is now {"Hello", "World", "Cat eats dog."}, note that the last period was not removed since it is not followed by whitespace. You can now write the sentences to files. You can do that by using a BufferedWriter:

try{
    String filename = "test";
    int i = 0;
    for(String sentence : splitString) {
        File file = new File(filename+""+i+""+".txt");
        file.createNewFile();
        /*returns false if the file already exists (you can prevent
          overriding like this)*/
        BufferedWriter writer = new BufferedWriter(new FileWriter(file));
        writer.append(sentence + "\n");
        i++;
     }
} catch(IOException ioexception)
{
    System.out.println(ioexception.getMessage());
    System.exit(1);
}

This now prints the split sentence in a new file, each in a different file. Watch out though since this can lead to space problems on FAT32 formatted systems (Standard), since they allocate 32kB for every file, whether or not it is at least 32kB large (File is 8kB, but takes up 32kB of space on drive). It might be a little impractical but it works. Now you just import-dir the directory all of these files are in and use the file from that in LDA. You can also read part of the provided tutorial here:

https://programminghistorian.org/lessons/topic-modeling-and-mallet#getting-your-own-texts-into-mallet

For larger files (around 5000 sentences and up [results in at least 160 MB of data]) I would suggest you do the splitting but instead of writing to many files, you just write to one and write your own way to import the data using the MALLET API. Look at http://mallet.cs.umass.edu/import-devel.php for a developers guide and at http://mallet.cs.umass.edu/api/ for more information on that.

score 1 · Answer 2 · answered Mar 29 '19 at 17:09

These functions will prepare your documents to be passed into your LDA. I'd also look into setting up a bow_corpus as LDA take numbers not sentences. It would be like the word "going" is stemmed to "go" then numbered/indexed to say 2343 and counted by frequency maybe it pops up twice so the bow_corpus would be (2343, 2) which an LDA would expect.

# Gensim unsupervised topic modeling, natural language processing, statistical machine learning
import gensim
# convert a document to a list of tolkens
from gensim.utils import simple_preprocess
# remove stopwords - words that are not telling: "it" "I" "the" "and" ect.
from gensim.parsing.preprocessing import STOPWORDS
# corpus iterator 
from gensim import corpora, models

# nltk - Natural Language Toolkit
# lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed 
# into present.
# stemmed — words are reduced to their root form.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

# Create functions to lemmatize stem, and preprocess

# turn beautiful, beautifuly, beautified into stem beauti 
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result


# send the comments row through the preprocessing step
# map itterates through rows into a function

processed_docs = documents['Your Comments title header'].map(preprocess)

score 0 · Accepted Answer · answered Feb 28 '18 at 11:59

You can for instance use the OpenNLP Sentence Detection Tools. They have been around for a while now and perform decently in most cases.

The documentation is here, the models can be downloaded here. Note that version 1.5 models are fully compatible with the newer opennlp-tools version 1.8.4

If you are using Maven, just add the following to your pom.

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.8.4</version>
</dependency>

If you plan to switch the model input from documents to sentences, please be aware that vanilla LDA (which also affects the current implementation in Mallet, afaik) may not produce satisfactory results since word co-occurrence counts are not very telling in sentences.

I would suggest to investigate whether the paragraph level is more interesting. Paragraphs in documents can be extracted with line break patterns. For instance a new paragraph starts when you have two consecutive line breaks.

How to subdivide the documents into sentences before Training Mallet LDA

3 Answers3