I'm building a corpus of texts harvested alongside some metadata from HTML with BeautifulSoup. It would be really helpful if I could call Mallet from within Python, and have it model topics from Python strings, rather than from text files in a directory. That way I could put the n keywords located by Mallet into each file.
I get a message saying that Mallet has been recognised when I run:
from nltk.classify import mallet
from subprocess import call
mallet.config_mallet("malletdir/mallet-2.0.7/bin")
But I haven't had any luck with the next steps, and am not even sure if Mallet accepts anything other than saved files.
I have not been able to turn up any documentation that I can really understand. Has anybody seen digestable documentation for this? (The NLTK book doesn't get into Mallet). I would also be happy to learn of any other means of topic modelling within Python that I could operationalise without a really deep knowledge of Python.
Sorry, this is my first rodeo.