0

I have documents arranged in folders as classes called categories. For a new input (such as a question asked), I have to identify its category. What is be the best way to do this using MALLET? I've gone through multiple articles about this, but couldn't find such a way.

Also, do I need to do sequence tagging on the input text?

1 Answers1

1
  1. First, you need to develop a training model from the documents arranged as folders. For Mallet, each folder will contain one or more documents and each folder will represent their class.

Once you have your training document, you need to create a file that can be understood by Mallet. Go to the bin folder of Mallet and enter commands like the following in the command line--

mallet import-dir --input directory:\...\parentfolder\* --preserve-case --remove-stopwords --binary-features --gram-sizes 1 --output directory:\mallet-file-name

This is just an example. The parameters in this query can be fully displayed if you type the following--

mallet import-dir --help
  1. Once you create this Mallet file, you need to train a model by putting a command such as the following--

    mallet train-classifier --trainer algorithmname --input directory:\mallet-file-name --output-classifier directory:...\model

Now that the model is created, you can use that model to classify a document with unknown class.

mallet classify-file --input directory:\...\data --output - --classifier classifier

This will provide the class of the document named data on the standard output.

If you need to use sequence tagging or not depends on the data that you are trying to classify.

Rushdi Shams
  • 2,423
  • 19
  • 31