Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
2
votes
1 answer

Mallet CRF Sequence Classification Training Data Format

I am trying to train a CRF sequence model using the Mallet library but I am missing some important information. I found a an example in the library itself at https://github.com/mimno/Mallet/blob/master/src/cc/mallet/examples/TrainCRF.java however…
user1893354
  • 5,778
  • 12
  • 46
  • 83
2
votes
2 answers

Mallet: OutOfMemoryError: Java heap space

While training data in Mallet, the processed stopped because of OutOfMemoryError. Attribute MEMORY in bin/mallet has already been set to 3GB. The size of training file output.mallet is only 31 MB. I have tried to reduce the training data size. But…
Long Le Minh
  • 335
  • 1
  • 2
  • 12
2
votes
1 answer

Extracting keywords from relevant topics using a trained MALLET Topic model

I'm attempting to use MALLET's TopicInferencer to infer keywords from arbitrary text using a trained model. So far my overall approach is as follows. Train a ParallelTopicModel with a large set of known training data to create a collection of…
francis
  • 5,889
  • 3
  • 27
  • 51
2
votes
1 answer

How do I use previous token's label as feature in my CRF?

I am looking for a way to use features conditioned with attributes and label bigrams in mallet. I am still trying to understand how would one be able to use the label of a token just generated as a feature for determining the label of the next…
afs
  • 167
  • 1
  • 9
2
votes
1 answer

Strange perplexity values of LDA model trained with MALLET

I have trained an LDA model with MALLET on parts of the Stack Overflow data dump and did a 70/30 split for training and test data. But the perplexity values are strange, because they are lower for the test set than for the training set. How is this…
phly
  • 185
  • 1
  • 12
2
votes
3 answers

Question about Latent Dirichlet Allocation (MALLET)

Honestly, I'm not familiar with LDA, but am required to use MALLET's topic modeling for one of my projects. My question is: given a set of documents within a specific timestamp as the training data for the topic model, how appropriate is it to use…
goh
  • 27,631
  • 28
  • 89
  • 151
2
votes
0 answers

How to train a sequence CRF model with Mallet

I am new Mallet user, I have started with the last stable version 2.0.8. My task is coding a sequence tagger. This is the code: ArrayList pipes = new ArrayList<>(); pipes.add(new SaveDataInSource()); pipes.add(new…
Dail
  • 4,622
  • 16
  • 74
  • 109
2
votes
1 answer

Hierarchical LDA eats up all available memory and never finishes

I am waiting for my membership on the mailing list to be confirmed, so I thought I would ask it here to maybe speed up the things a little bit. I am writing my master's thesis on topic modeling and use Mallet implementations of LDA and HLDA. I work…
wojtuch
  • 188
  • 2
  • 11
2
votes
1 answer

Get phi, theta, doc.length, vocab, term.frequency from mallet LDA object?

I am trying to use a mallet topic model with the LDAvis package. To do so you must extract a number of parameters from the topic.model object: phi, theta, vocab, doc.length, and term.frequency. The mallet documentation makes no mention of these…
histelheim
  • 4,938
  • 6
  • 33
  • 63
2
votes
2 answers

Mallet basic usage. First steps

I'm trying to use Mallet with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic…
Kirill
  • 364
  • 4
  • 14
2
votes
1 answer

How is the weight of a word in a topic calculated in Mallet?

I'm trying to figure out what the weight assigned to each word in a topic represents in Mallet. I'm assuming it's some form of document occurrence count. However, I'm having a hard time figuring out how that figure is arrived at. In my model, there…
Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73
2
votes
1 answer

Mallet POS-Tagging learning time

I've been trying to use the Mallet Simple Tagger (http://mallet.cs.umass.edu/sequences.php) to learn a CRF- Model for POS-Tagging. I am now starting to get worried/confused as my computer has been learning for this one model for over a week. It does…
Kai
  • 21
  • 4
2
votes
0 answers

How to load MALLET LDA model state from R?

I have been using the mallet package in R that wraps MALLET. So far so good, but I was wondering how I can save and load th model state on the disk, so that I don't have to train all over and get a different model each time. For the saving part it…
Yannis P.
  • 2,745
  • 1
  • 24
  • 39
2
votes
1 answer

rJava gives a NullPointerException in .jcall

I am trying to run a standard corpus loading method in the mallet R package and more specifically instance <- mallet.import(names(txt$CELEX), txt$TEXT, stoplist.file = "stopwords.en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") Then I get…
Yannis P.
  • 2,745
  • 1
  • 24
  • 39
2
votes
0 answers

How to TF-IDF transform an InstanceList of FeatureVectors - MALLET

I have a MALLET InstanceList where the data fields of the Instance objects are MALLET FeatureVector's. I want to TF-IDF transform them with the same effect as…
Mark Collier
  • 51
  • 1
  • 4