1

I'm new to using Mallet. I usually use WEKA for classification, and now I'm trying to use Mallet for text classification. In Weka, there are attributes (such as word length or top-n word occurrence) that we choose ourselves and make the .arff file.

I have read about the input format for Mallet in http://mallet.cs.umass.edu/import.php but I'm still confused. How do we assign attribute in the input format? How do we tell this document belongs to a certain class? For example, a document belongs to "sports" class?

Any example of input format file will be very appreciated.

Thanks!

kaylak
  • 11
  • 4

1 Answers1

4

-How do we tell this document belongs to a certain class?:

You can have one folder per class, for example: C:/Corpus/Class1 C:/Corpus/Class2 C:/Corpus/Classn and each folder contains the documents which belong to that class.

How do we assign attribute in the input format?

If you want to know the options of file importing,go to: C:/mallet/bin and once you are there: mallet import-dir --help and the options to import files will be displayed, for example --remove-stopwords, --gram sizes.

Example code to import files:

bin/mallet import-dir --input C:/Corpus/* --output corpus.mallet --gram sizes 1,2 --preserve-case

AnaB
  • 41
  • 1
  • Thanks for your answer. So I suppose that the default attribute type of Mallet is unigram in which all words are the attributes? – kaylak Jul 15 '15 at 13:54
  • 1
    Exactly. You can choose bigrams, with --gram sizes 1,2 for example. – AnaB Jul 16 '15 at 16:04
  • Great! Your answer really help me understand Mallet for classification :) Now I've used Mallet for my research. – kaylak Jul 18 '15 at 18:32