Passing Python strings to Mallet for topic modelling

Question

I'm building a corpus of texts harvested alongside some metadata from HTML with BeautifulSoup. It would be really helpful if I could call Mallet from within Python, and have it model topics from Python strings, rather than from text files in a directory. That way I could put the n keywords located by Mallet into each file.

I get a message saying that Mallet has been recognised when I run:

from nltk.classify import mallet
from subprocess import call
mallet.config_mallet("malletdir/mallet-2.0.7/bin")

But I haven't had any luck with the next steps, and am not even sure if Mallet accepts anything other than saved files.

I have not been able to turn up any documentation that I can really understand. Has anybody seen digestable documentation for this? (The NLTK book doesn't get into Mallet). I would also be happy to learn of any other means of topic modelling within Python that I could operationalise without a really deep knowledge of Python.

Sorry, this is my first rodeo.

score 2 · Answer 1 · answered Dec 03 '14 at 00:09

In case you are still looking for a solution: Gensim (a Python topic modeling/machine learning packet) has a wrapper for Mallet which is easy to use and well documented. Here are some Gensim tutorials and a specific tutorial for the Mallet wrapper. You may also want to read some installation instructions (mostly the part about setting Java memory) here and then you'd be ready to go.

score 1 · Answer 2 · answered Mar 18 '14 at 13:57

I once tried implementing Mallet with an NLTK project and I too ran into dead end after dead end. I think that main thing to keep in here is Mallet is Java based while NLTK is written in Python.

You already knew that but my point is for me personally I struggled with mixing the technologies because I do not have a strong background with Java. I've received the same feedback from coworkers about Mallet with Python, "Be ready to spend a lot of time debugging."

Since then I've been using the sklearn library for Python. It is aimed at machine learning more generally, not directly for NLP but can be used for it quite nicely. It comes with a very large selection of modelling tools and most of it seems to rely on NumPy so it should be pretty fast. I've used it quite a bit and can say that it is very well written and documented.

I don't want to discourage you from using Mallet, especially just because I said so. But if you are open to alternatives, I think you will find that when building projects with NLTK it's far easier to using Python modules since it itself is written in Python. I hope this helps!

Yes, I read up a little on sklearn, but getting things working is just as bad, at present---I'm on OSX Lion, and I can't get Scipy or matplotlib to install properly! Assuming I can, does this method seem about right to you: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html ? Sorry to bug you---this is a fairly minor component of a thesis, overall, and I'd prefer not to head down too many dead ends. — user2437842, Mar 19 '14 at 02:06

Passing Python strings to Mallet for topic modelling

2 Answers2