Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
1
vote
1 answer

What is the correct svmlight input format in Mallet?

I am using Mallet with the SVMLight input format to do classification usingNaiveBayes classifier. But I get a NumberFormatException. I'm wondering how I can use strings features when using SVMLight. As I read in the guideline 1, the features can…
user1419243
  • 1,655
  • 3
  • 19
  • 33
1
vote
0 answers

Linear Chain CRF with Feature Function Filtering

I am working on collective classification of entities and using the CRFClassifier class for sequence labelling. I have a requirement that a certain feature F_i should NOT be considered with certain class label C_i. I have specified various flags in…
sapan shah
  • 11
  • 2
1
vote
1 answer

Chinese characters garbled when importing into MALLET

I am trying to use MALLET for topic modeling of a Chinese text. As the first step I used Stanford Word Segmenter to get something looking like this: > 关于 处理 五反运动 遗留 问题 的 指示 转发 华东局 批转 浙江 省委 批转 省委 办公厅 关于 粮食 统销 工作 与 处理 > 意见 的 报告 和 对 打击 富农 奸商 投机…
lepuck
  • 21
  • 2
1
vote
1 answer

mallet error IllegalArgumentExce ption: Couldn't read InstanceList from file complaints.mallet

I am trying to use Mallet for a research project and I keep getting the same error. Here are the instructions I have been using: Once you have all of the files in the Complaints folder Step 1: Clean the files using CAT Scanner Open the program from…
1
vote
1 answer

Lda using mallet

I run the file simple lda.java and I got exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 at cc.mallet.topics.SimpleLDA.main(SimpleLDA.java:560)
sanjana
  • 11
  • 2
1
vote
1 answer

Extending LDA model using Mallet

I am trying to extend LDA model by adding another layer of locations. Is it possible to add another layer to Mallet? if so, which classes should I extend? The process I'm trying to model: 1. Choose a region 2. Choose a topic 3. Choose a word
Elad Kravi
  • 43
  • 5
1
vote
1 answer

MALLET - Which weighting schema?

I am using MALLET for text classification (with Naive Bayes) and I understand there is this FeatureSequence2FeatureVector() method for creating feature vectors that can be used as part of the Pipe. My question is which weighting schema is…
B.O.
  • 13
  • 2
1
vote
1 answer

unrecognized option: --diagnostics-file in Mallet

I am using LDA in mallet to explore my data. I do not have training and test data. I just use it for clustering my data. I would like to use a number of useful diagnostic measures provided by Mallet. but when I use this query: bin\mallet…
GeoBeez
  • 920
  • 2
  • 12
  • 20
1
vote
2 answers

Reading documents with r-tm to use with r-mallet

I have this code to fit a topic model with the R wrapper for MALLET: docs <- mallet.import(DF$document, DF$text, stop_words) mallet_model <- MalletLDA(num.topics = 4) mallet_model$loadDocuments(docs) mallet_model$train(100) I have used the tm…
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46
1
vote
1 answer

Input data to mallet in parallel

I am trying to build a text classifier using mallet. The data is somehow big so I am looking for a way, if possible, to run the "import" task on multiple threads because it is taking a long time to load. Few questions here: Is there a way to…
1
vote
2 answers

Adding MALLET to the Bash path

I have run into a problem with adding the MALLET topic modelling tool to my path. If I cd to /mallet-2.0-8/ and type ./bin/mallet, all works fine. If I type echo $PATH, I have successfully added '/mallet-2.0.8/bin' to the path. But typing mallet now…
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46
1
vote
1 answer

When using Mallet, how do I get a list of topics associated with each document

When using Mallet, how do I get a list of topics associated with each document? I think I need to use train-topics and --output-topic-docs, but when I do, I get an error. I'm using Mallet (2.0.8), and I use the following bash script to do my…
ericleasemorgan
  • 213
  • 1
  • 11
1
vote
1 answer

cmd for hLDA topic modeling in mallet

I am trying to use hLDA for topic modeling in mallet.Ihave already checked this. Using cmd bin\mallet train-topics --input tutorial.mallet according to thistutorial. By default LDA topic modeling is being performed. How can I change it into…
sibora
  • 11
  • 3
1
vote
0 answers

Mallet, probabilities of labels

I'm trying to use the Mallet Simple Tagger (http://mallet.cs.umass.edu/sequences.php) for detecting specific kinds of words in texts (specifically, prominent words). I'm running it with following standard commands: for training java -cp…
1
vote
2 answers

how to get probability of words of topics in Mallet

I am using LDA in mallet to explore my data. I do not have any problem with running, just I need to have the probability of top words (let's say 20 words) I use this query: bin\mallet train-topics --input tutorial.mallet --num-topics 40…
GeoBeez
  • 920
  • 2
  • 12
  • 20