Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
4
votes
0 answers

How to get the probability distribution of words for a particular topic?

I am doing topic modelling using Mallet and everything works fine except that I am unable to get the probability distribution of the words in any particular topic. However, I am using the below code to print the topic proportions for any particular…
London guy
  • 27,522
  • 44
  • 121
  • 179
4
votes
2 answers

Passing Python strings to Mallet for topic modelling

I'm building a corpus of texts harvested alongside some metadata from HTML with BeautifulSoup. It would be really helpful if I could call Mallet from within Python, and have it model topics from Python strings, rather than from text files in a…
user2437842
  • 139
  • 1
  • 10
4
votes
1 answer

Topic Modelling mallet: how to interpret the Kullback-Leibler divergence

After obtaining various probability distributions from various documents in mallet, I have applied the following code to calculate the KL divergence between the first and the second document: Maths.klDivergence(double[] d1,double[] d2); How…
user3318618
  • 73
  • 1
  • 9
4
votes
2 answers

how to get word-topic probability using mallet

I've made a parallel topic model using mallet. And I want to get top-words for each document. To do that, I'm trying to get a word-topic probability matrix. How would I achieve this?
4
votes
1 answer

Dealing with integer-valued features for CRF in mallet

I am just starting to use the SimpleTagger class in mallet. My impression is that it expects binary features. The model that I want to implement has positive integer-valued features and I wonder how to implement this in mallet. Also, I heard that…
Nick
  • 2,924
  • 4
  • 36
  • 43
3
votes
3 answers

Topic Modeling using Mallet Java Api?

Hi i have to do topic modeling using Mallet Java API but i am new to coding so i am finding it real difficult to understand the Java libraries and use them. Does anyone has some sample code where they do topic modeling using the API which can be…
Yogesh Sharma
  • 61
  • 3
  • 5
3
votes
1 answer

PyLDAvis visualisation does not align with generated topics

I am using PyLDAvis to visualise the results of the LDA from Mallet. Before I can do that, I need the wrapper of the gensim library: model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(model_list[8]) When I print the found topics, they…
gython
  • 865
  • 4
  • 18
3
votes
0 answers

java.lang.NegativeArraySizeException When Making Document Topics Matrix using RMallet

I'm trying to write some code to get a Mallet Instance List file into a document topics matrix in R. To do this, I read the instance list file into a topic trainer variable called 'topic.model'. Below is the function call I am making to create a…
mootechs
  • 41
  • 1
3
votes
1 answer

LDA Mallet CalledProcessError

I am trying to implement the following code: import os os.environ.update({'MALLET_HOME':r'c:/mallet-2.0.8/'}) mallet_path = 'C:\\mallet-2.0.8\\bin\\mallet' ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow, num_topics=20,…
3
votes
3 answers

python mallet LDA FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\abc\\AppData\\Local\\Temp\\d33563_state.mallet.gz'

It is my first time to use mallet LDA. Basically, I downloaded the mallet-2.0.8 zip file and JDK. I installed JDK, extracted mallet-2.0.8 to a destination folder. I set the MALLET_HOME. Here is my…
yoyo
  • 31
  • 2
  • 6
3
votes
2 answers

Mallet Topic Modelling API - How to decide number of intervals needed or best for optimization?

Sorry I'm quite the beginner in the field of NLP, as the title says what is the best interval for optimization in Mallet API? I was also wondering if it was dependent or related to the number of iterations/topics/corpus etc.
3
votes
1 answer

Error in Mallet Java

I want to do topic modelling , So, I ran the below command :- bin\mallet train-topics --input web.mallet --output-state output-file.gz It tells me :- Topic modeling currently only supports feature sequences: use --keep-sequence option when…
shahrukh
  • 73
  • 5
3
votes
1 answer

How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

I'm trying to somehow compare a sole document's topic distribution (using LDA) with, other files and their topic distributions within a previously created topic model, using MALLET. I know that this can be done through MALLET commands in terminal…
higz555
  • 115
  • 8
3
votes
0 answers

Mallet LDA ArrayIndexOutOfBoundsException while training the model

I am trying to build a model with 500 or 1000 topics on a 1M document dataset with Mallet LDA. After 60 iterations I am getting an ArrayIndexOutOfBoundsException. The error message is as below: <60> LL/token: -7.64386 overflow on type…
ak.
  • 143
  • 9
3
votes
1 answer

how using --use-ngrams in mallet

I want to run mallet using the --use-ngrams true option but can't seem to get it working. bin\mallet import-file --input ovary.txt --output ovary2.mallet --keep-sequence-bigrams --remove-stopwords bin\mallet train-topics --input ovary2.mallet…
Ali N
  • 31
  • 2
1 2
3
21 22