1

I am trying to build a text classifier using mallet. The data is somehow big so I am looking for a way, if possible, to run the "import" task on multiple threads because it is taking a long time to load. Few questions here:

  1. Is there a way to manually parallelize the process by dividing the data and importing it separately then join them. I know I can run them in parallel and get multiple input files, but can I combine the resulting mallet input files before training the classifier?

  2. Does mallet itself parallalize this process if there are available threads on the machine?

Thanks for help!

baker
  • 11
  • 3

1 Answers1

0

Actually your questions doesn't seem to be directly related to mallet. So to answer your question two Mallet doesn't do such thing. But you can split the text into equal parts then use them by keeping all at the same folder and providing Mallet the path of that folder. This link can help you achieve it. You need to follow the instructions on One instance per file part.

  • I am doing what you mentioned at his time, but what happens is that one file in the folder is processed at a given point. What I am looking for is to make the process parallel. Splitting the data into different parts will not allow parallel processing. – baker Apr 12 '17 at 20:19
  • 1
    I am mainly looking to parallelize the "import data" step. I am not having a problem in the loading step, but it is taking a lot of time – baker Apr 12 '17 at 20:26
  • Maybe you should clarify the purpose of your project then I may be able to help more. Because from my point of view there is no need to parallelize the import process since you can split the data into parts. – Aaron Clifton Apr 20 '17 at 16:10
  • I am trying to classify docs into three groups, so I have a 'home' directory which contains 3 sub-directories (group1, group2, group3). My data is split into these 3 sub-directories based on the class label of each doc. Each of these sub-directories has a large number of relatively big files. To build the classifier I have to import the data first, and I am providing the home directory when running the import command. For this importing step, is there a way to make it parallel (in addition to the three sub-directories parallelism) while preserving the labels of the files? – baker Apr 21 '17 at 21:09