Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
3
votes
0 answers

How to incorporate my own feature functions into Mallet CRF?

I am implementing my own CRF model, and I want to use Mallet's CRF trainer to get weights for the feature functions I implemented. How can I pass my feature functions to Mallet's CRF, so that it would search for their optimal weights?
3
votes
2 answers

Mallet: Alphabets don't match exceptions

I try to implement a document classifier with Mallet in Java. I already have a file that essential contains feature values. So I don't want to run through an entire raw text processing pipeline. A line in my feature file looks like this at the…
toobee
  • 2,592
  • 4
  • 26
  • 35
3
votes
2 answers

How much time for a topic modeling via MALLET on 9GB corpus

I would like to do LDA topic modeling on a 9GB corpus. The plan is to train LDA model using MALLET for 1000 iterations with 100 topics, optimizing hyperparameters every 10 iterations after a 200 iteration burn-in period. I am working on 64-bit Win8,…
GileBrt
  • 1,830
  • 3
  • 20
  • 28
3
votes
1 answer

Mallet: java.lang.OutOfMemoryError with 1024GB Memory allocation

I am trying to use Mallet to run topic modeling on a ~1GB text file, with 11403956 rows. From the mallet directory, I cd to bin and upgrade the memory requirement to 1024GB: set MALLET_MEMORY=1024G I then try to run the command: bin/mallet…
duhaime
  • 25,611
  • 17
  • 169
  • 224
3
votes
0 answers

Difference in features between CRF++ and SimpleTagger in Mallet

I'm doing some experiments to compare time performance between CRF++ and SimpleTagger in Mallet. However, after running, I see there is a gap difference in accuracy between them, although I set same parameter (L2-norm). I try to figure it out by…
kidstar
  • 41
  • 3
3
votes
1 answer

Mallet - Topic Modeling - Stopwords Error

Although i add extra stopwords list and default stopwords list when i use MALLET for topic modeling, some stop words appear in topic models. For example "ın", "ıf", "ıt". How do i ensure that this stopwords don't appear in topic models? Topic models…
bubunny
  • 39
  • 5
3
votes
1 answer

MALLET topic modeling OutOfMemoryError

I use MALLET for topic modeling. http://mallet.cs.umass.edu/topics.php First, I try to import the training document set following the instruction. bin/mallet import-dir --input /data/topic-input --output topic-input.mallet --keep-sequence…
Benben
  • 1,355
  • 5
  • 18
  • 31
3
votes
1 answer

MALLET Topic Modeling: input String

I have this code to import a file .mallet: File f=new File("/home/test/file.mallet"); InstanceList t=InstanceList.load(f); but if I wanted to switch manually every single instance, how could I do? I tried this: String str="Test for…
Enzo
  • 597
  • 1
  • 8
  • 22
3
votes
2 answers

Mallet in R regex error :java.lang.NoSuchMethodException: No suitable method for the given parameters

Ive been following the tutorial on how to use mallet in R to create topic models. My text file has 1 sentence per line. It looks like this and has about 50 sentences. Thank you again and have a good day :). This is an apple. This is awesome! LOL! i…
jxn
  • 7,685
  • 28
  • 90
  • 172
3
votes
0 answers

mallet "import file" use pipe

Currently I'm using mallet and when it comes the import of data, I'm fine with import file or import directory according to the APIs or explanations online, but when it comes to the infer-topics part, it's said that the new document should be…
JudyJiang
  • 2,207
  • 6
  • 27
  • 47
3
votes
1 answer

Training the classifier in mallet

i have a csv file with the following format productname, review of a the product now using mallet i have to train the classifier so that if a test dataset is given as input which contains product reviews, it should tell me to which product a…
2
votes
1 answer

Mallet: Topical N-grams

I want to run mallet using the --use-ngrams true option but can't seem to get it working. I've imported my data using: ./bin/mallet import-dir --input path --output topic-input.mallet --keep-seqence -- removed stopwords Now I want to train a…
akobre01
  • 777
  • 1
  • 10
  • 22
2
votes
1 answer

How to import excel file in mallet

I have excel file that contains posts title of stack overflow posts. My excel sheet have more than 10,000 lines. Therefore it is not possible to make separate txt for each row. If I copy my excel data into .txt file is it required to have labels or…
2
votes
1 answer

Stemming and lemmatizing - What approach?

I am preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts, of course. I have my lists of stopwwords ready and I know…
Glorifier
  • 31
  • 1
2
votes
2 answers

How to use Mallet for NER

I'm new to the subject of NLP and requested to perform -named entity recognition- (NER) using Mallet. I have a text, and I give feature vector for each word in it. I would like to train a model which later on I can test on fresh text file. My…
Omer
  • 31
  • 1
  • 2