Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
1
vote
2 answers

How to fix mallet on gensim

I wrote LDA model in notebook. I'm trying to wrap my gensim LDA model with mallet, getting the following error: CalledProcessError: Command '../input/mymallet/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords…
1
vote
1 answer

Why can I not choose a beta parameter when conducting LDA with Mallet?

I am recently working with Mallet to conduct LDA Topic Modeling. I recognized that I am able to pass the alpha hyperparameter for the algorithm to Mallet, but the LDAMallet class does not contain any variable for the beta parameter. Can you guys…
user13567633
1
vote
0 answers

LDA model: why are topic "words" numbers?

I have a set of trigrams (see pickle file). The column name is the trigram; each cell represents a document; the cell entries denominate the occurrence (binary). I then preprocess the trigrams and train an LDA model using the below code. However,…
user456789
  • 331
  • 1
  • 3
  • 9
1
vote
1 answer

Mallet DMR negative propability for feature-based topic-distribution?

I've created a DMR Topic model (via Java API) which calculates the topic distribution based on the publication-year of the documents. The resulting distribution is a bit confusing, because there are a lot of negative propabilities. Sometimes all…
HaPlasma
  • 25
  • 6
1
vote
1 answer

Use Log Likelihood to compare different mallet topic models?

I'm trying to find out if it's possbible - or what's the best way - to compare programmatically different topic models created with mallet to determine the "best" fitting model for the given corpus. The API offers a Method to determine the Log…
HaPlasma
  • 25
  • 6
1
vote
1 answer

Mallet outputting either topic weight 0.0 or 1.0 and nothing in between

So created a little program using mallet's API following this example in the developer's guide. However, I do not understand the final weight output. While the program is running it is outputting reasonable weights to each topic(see below): Mallet…
1
vote
3 answers

IndexError: list index out of range in Python Script

I'm new to python and so I apologize if this question has already been answered. I've used this script before and its worked so I'm not at all sure what is wrong. I'm trying to transform a MALLET output document into a long list of topic, weight,…
1
vote
0 answers

Why does Mallet LDA give poor results when then Gensim version doesn't?

I'm working my way through LDA models for text analysis; I've heard that the Mallet implementation is the best. However, it seems to generate very poor results when I compare it with the Gensim version, so I think I may be doing something wrong. Can…
Lodore66
  • 1,125
  • 4
  • 16
  • 34
1
vote
1 answer

Mallet NaiveBayes Classifier in Java null pointer

I am trying to instantiate a naive Bayes classifier to classify text blocks (with a pre-defined classification). The example below just tries to do it with male/female. I have tried loading data from file (CSVloader) and by creating instances below.…
1
vote
0 answers

Unable to perform Topic Modelling in Databricks with gensim mallet

I am trying to perform Topic modelling on Databricks using the Gesim wrapper for Mallet. I already have running code for the same on my Local system. Here is some sample code that already works on my local System: import…
1
vote
2 answers

CalledProcessError: Returned non-zero exit status 1

When I try to run: def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod1[doc] for doc in texts] # Remove Stop…
Emil
  • 1,531
  • 3
  • 22
  • 47
1
vote
1 answer

How to automatically generate one or two words to represent a topic?

Mallet generates topics with top keywords. The keywords are unique for one topic. Is there an automatic way to select a certain word or several words from the topic keywords as the topic labeling. For example, 20 topic are generated from 500…
Dylan
  • 1,183
  • 4
  • 13
  • 26
1
vote
1 answer

Coherence graph blank - Coherence Value of nan

Thanks for stopping by. I was trying to get some help with this graph that is showing up blank. I'm following this tutorial #17 https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ to build a graph of coherence scores for…
Sara
  • 1,162
  • 1
  • 8
  • 21
1
vote
1 answer

How to predict test data on Gensim Topic modelling

I have used Gensim LDAMallet for topic modelling but in what way we can predict sample paragraph and get their topic model using pretrained model. # Build the bigram and trigram models bigram = gensim.models.Phrases(t_preprocess(dataset.data),…
1
vote
1 answer

Python Gensim LDAMallet CalledProcessError with large corpus (runs fine with small corpus)

I'm getting a CalledProcessError "non-zero exit status 1" error when I run the Gensim LDAMallet model on my full corpus of ~16 million documents. Interestingly enough, if I run the exact same code on a testing corpus of ~160,000 documents the code…
ctim
  • 33
  • 6