Questions tagged [mallet]

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

From Mallet's website:

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

321 questions
1
vote
1 answer

infinity value error after token regex command

I'm trying to use the command --token-regex '[\p{L}\p{M}]+', with the usual commands for importing text, so that mallet can read german text. No error-message is shown and a new file created. It is suspiciously small however. Then, using…
blub123
  • 31
  • 5
1
vote
1 answer

Using pre-defined topics in Mallet

I'm looking to use Mallet to classify different documents by topics that I have defined. I know that Mallet will first determine the topics, then classify the documents but I want to skip the first step because I already have a list of topics with…
NLP
  • 11
  • 2
1
vote
1 answer

Truncate tokens for a topic model in MALLET

I want to truncate all tokens in a corpus to have a maximum length of 5 characters. Is there a way to set the --token-regex import option in MALLET to accomplish this? The code I'm currently using to import documents is this: mallet-2.0.7/bin/mallet…
Jim
  • 21
  • 4
1
vote
0 answers

Conditional random field, concept and terminology clarification needed: markov order, transition, connectivity

I am using Mallet to use Conditional Random Field. From my understanding, CRF has a few kinds of markov order depending on how nodes are connected. In the figure, it's the three quarter order, the first order, and the second order from the top. …
pandagrammer
  • 841
  • 2
  • 12
  • 24
1
vote
0 answers

Getting word-topic probabilities with mallet

I am using mallet through the terminal. I have imported the training data in a single file format: project5 TokenNameCOMMENT This is the actual text and I have used the train-topics command to create topic models of several sizes. What I want to…
Nikos
  • 11
  • 2
1
vote
1 answer

MALLET Ranking of Words in a topic

I am relatively new to mallet and need to know: - are the words in each topic that mallet produces rank ordered in some way? - if so, what is the ordering (i.e.) is 1st in a topic list the one with the highest distribution across the corpus? Thanks!
1
vote
2 answers

Mallet works in Linux but not Windows

OK I'm trying to use Mallet to classify some documents in Windows I've achieved it in Linux. Just can't get it do the job in Windows (target environment) I've imported the data into a .mallet file. And then created a classifier using this input…
bendecko
  • 2,643
  • 1
  • 23
  • 33
1
vote
1 answer

How do you view the labeling of the test set with GenericAcrfTui from the command line?

I am training and testing data using Mallet's GenericAcrfTui. So I am using the Graphical Models in Mallet (GRMM) to do CRF training. I have created features for both my training set and my test set and was hoping to run GenericAcrfTui from the…
demongolem
  • 9,474
  • 36
  • 90
  • 105
1
vote
1 answer

how to get probability of each topic in mallet

I am doing topic modelling with mallet.I have imported my file(each document in a line)and I trained mallet with 200 topics.Now I have 200 topics with words related to them for each topic.Now I need to know each topic`s probability.How can I…
1
vote
1 answer

Mallet Feature Selection similar to setting feature values to 0

I'm looking at the Mallet source codes, and it seems that most of the classifier implementations (e.g naive bayes) didn't really take into account the feature selections even though the InstanceList class has a setFeatureSelection method. Now I want…
goh
  • 27,631
  • 28
  • 89
  • 151
1
vote
2 answers

mallet doesnt work even for help command

I am gonna use mallet fir topic modelling and I am using linux. I have installed mallet (but not ant)and I have java in my system.So when I want to use commands for mallet it doesnt work.The only command that works is : bin/mallet which gives me a…
1
vote
0 answers

Mallet and word stems

I'm using mallet for a text classification task and it seems to be that mallet is applying some word stemming algorithm on my text. How can I configure mallet to avoid using this feature?
mac2bua
  • 138
  • 1
  • 8
1
vote
1 answer

MALLET for automatic topic tagging - with training data

I have a corpus of documents, which I have already tagged. I have fixed list of about 400 tags - relating to different topics. Each document has been tagged with one or more tags, and a short title. (I also have a much larger list of titles - which…
swami
  • 673
  • 1
  • 9
  • 18
1
vote
0 answers

The right Mallet class for a Topic Model

I'm working with the Mallet library for a project in Java. I have 15,000 documents with 400 tokens each. I tried using ParallelTopicModel. But I would like to have a set of topics that contain both single tokens and sequences of tokens (e.g. "Java"…
0
votes
0 answers

Topic Modelling with LDA Gensim (3.8.3) Python - Problem with LdaMallet attribute

I'm new to Python and I'm having trouble with a function that has been discussed many times already, most recently here: Extract Topic Scores for Documents LDA Gensim Python problem of sorting tuples I've done what's suggested in the answer: def…
nazeli
  • 1
  • 1