I am using the JAVA API of Weka to classify documents according to different textual features. When using the TextDirectoryLoader
class I am able to load a directory with txt files containing some text, transform the text to numerical feature and classify the instances later on. The problem is that this text will be represented by a single String feature in the dataset.
For instance:
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File(dataDir));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
'dataRaw' will contain one attribute that is the text and one attribute that is the class (derived from the directory taxonomy):System.out.println( dataRaw.numAttributes()); // outputs 2
Is it possible to separate the text into the original txt files (for instance with delimiters?), so that different textual attributes are loaded instead of one?
One option would be to insert some new attributes afterwards, e.g.:
dataRaw.insertAttributeAt(new Attribute("attr2", (FastVector) null), dataRaw.numAttributes());
for (int i = 0; i < dataRaw.numInstances(); i++) {
dataRaw.instance(i).setValue(2, "sometext");
}
Or create an arff file like:
@relation whatever
@attribute attr1 String
@attribute attr2 String
...
Is there any way the above setting could be achieved via the 'TextDirectoryLoader'? Thanks in advance!