0

I am using the JAVA API of Weka to classify documents according to different textual features. When using the TextDirectoryLoader class I am able to load a directory with txt files containing some text, transform the text to numerical feature and classify the instances later on. The problem is that this text will be represented by a single String feature in the dataset.

For instance:

TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File(dataDir));

Instances dataRaw = loader.getDataSet();

StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataFiltered = Filter.useFilter(dataRaw, filter);

'dataRaw' will contain one attribute that is the text and one attribute that is the class (derived from the directory taxonomy):System.out.println( dataRaw.numAttributes()); // outputs 2

Is it possible to separate the text into the original txt files (for instance with delimiters?), so that different textual attributes are loaded instead of one?

One option would be to insert some new attributes afterwards, e.g.:

dataRaw.insertAttributeAt(new Attribute("attr2", (FastVector) null), dataRaw.numAttributes());
for (int i = 0; i < dataRaw.numInstances(); i++) {
        dataRaw.instance(i).setValue(2, "sometext");
}

Or create an arff file like:

@relation whatever

@attribute attr1 String

@attribute attr2 String

...

Is there any way the above setting could be achieved via the 'TextDirectoryLoader'? Thanks in advance!

KLaz
  • 446
  • 3
  • 11

1 Answers1

0

Once you have your files loaded and a dataset is created in the form [textString, classLabel], you can process that string attribute using stringToWordVector() filter, which creates a new attribute for every word you have present or not in your initial string -> [word0, word1,...wordN, classLabel]. This way you could further process your updated dataset or directly perform the task of choice (clustering, classification, etc).

To clarify, the mentioned filter decomposes your text string into a set of counting (or frequencies) of words, appropriate for data mining tasks.

shirowww
  • 533
  • 4
  • 18