Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
0
votes
1 answer

Using a random seed in Rmallet

Is there an option (or a workaround) in Rmallet to use a random seed, as is possible through the mallet command line (i.e. --random-seed 1)?
shackett
  • 323
  • 2
  • 5
0
votes
1 answer

Mallet, how to use ExpGain and GradientGain method to construct a FeatureSelector

I want to test the accuracy of a text classifier built with Mallet,there are 4 feature selection methods available.(FeatureCounts,InfoGain,ExpGain and GradientGain). i want to know how to use ExpGain and GradientGain. Eg: FeatureSelector…
kagtag
  • 1
  • 1
0
votes
1 answer

Trouble understanding the data field in MALLET instance object

currently I'm working on a project and am using a CsvIterator from the MALLET API to create an InstanceList. However, I'm not sure quite how the data field in a MALLET Instance object is supposed to be formatted. I'm attempting to write the data…
0
votes
1 answer

Mallet - SimpleTagger Main class not found

I downloaded and installed the latest version of Mallet. I build it with success but when I try to run SimpleTagger : java -cp mallet-deps.jar cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample or java -cp…
Ana Maïs
  • 41
  • 3
0
votes
1 answer

CRF Mallet model file

What is model-file when we train CRF Mallet? java -cp "/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample
ahmed123
  • 15
  • 4
0
votes
1 answer

topic proportions in my corpus?

Thanks for reading and taking the time to think about and respond to this. I am using Gensim's wrapper for Mallet (ldamallet.py), and it works like a charm. I need to get the topic proportions for my corpus (over all my documents) and I do not know…
JRun
  • 669
  • 1
  • 10
  • 17
0
votes
0 answers

how to get a probability distribution for a topic in mallet?

Using mallet I can get a specific number of topics and their words. How can I make sure topic words make a probability distribution (ie sum to one)? For example if I run it as bellow, how can I use the outputs given by mallet to make sure…
samsamara
  • 4,630
  • 7
  • 36
  • 66
0
votes
1 answer

Generating documents from LDA topic model

I'm learning a topic model from a set of documents and that's working well. But I'm wondering if any existing system will actually generate new documents from the topics and words in the model. Ie. say I want a new document of topic 0, will any of…
ten
  • 115
  • 1
  • 8
0
votes
1 answer

How to find the number of documents (and fraction) per topic using LDA?

I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api. public static…
Khaled
  • 255
  • 4
  • 16
0
votes
1 answer

Using topic modeling Java toolkit

I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news. I saw two Java toolkits:…
S_M
  • 290
  • 3
  • 18
0
votes
2 answers

MALLET Java API Importing Data

I am trying to do Topic Modeling with the Java API. There is a handy example provided with the package. However, given the much larger size of my data, I think it would be impractical to import it all from one file. I looked at the powerpoint…
pjshap
  • 72
  • 11
0
votes
1 answer

Wrong input arguments for MALLET topic modeling?

I tried to run MALLET topic modeling using the following via command line: bin/mallet import-dir --input data\my_text \ --output my.mallet \ --remove-stopwords TRUE \ --keep-sequence TRUE \ …
wnk
  • 1
  • 2
0
votes
1 answer

Keep digits in Mallet topic modeling

I am using Mallet for topic modeling. A large amount of words in my input text include both letters and digits; e.g., A54, D892. I just noticed that Mallet automatically removes the digits and only keeps the letters in the words. I even do not use…
SM.
  • 1
  • 1
0
votes
0 answers

How to import and use feature vectors in MALLET's topic modelling?

I am using MALLET's topic modelling. I have set of keywords along with weights for a set of documents which I want to train and use the model to infer new documents. Note: each keyword of the document has weight assigned to it which is similar to…
sravan_kumar
  • 1,129
  • 1
  • 13
  • 25
0
votes
1 answer

How to add new documents to existing topic model in mallet or batch the model for large document counts

I want to use topic modeling and found MALLET suitable for me. I successfully created my first demo using some 0.1 million Documents.Now as per my requirements i have to deal with 10 million documents for which am not able to processed further.Is…
Hardik Dobariya
  • 339
  • 2
  • 4
  • 20