How do I transform text into TF-IDF format using Weka in Java?

Question

Suppose, I have following sample ARFF file with two attributes:

(1) sentiment: positive [1] or negative [-1]

(2) tweet: text

@relation sentiment_analysis

@attribute sentiment {1, -1}
@attribute tweet string

@data
-1,'is upset that he can\'t update his Facebook by texting it... and might cry as a result  School today also. Blah!'
-1,'@Kenichan I dived many times for the ball. Managed to save 50\%  The rest go out of bounds'
-1,'my whole body feels itchy and like its on fire '
-1,'@nationwideclass no, it\'s not behaving at all. i\'m mad. why am i here? because I can\'t see you all over there. '
-1,'@Kwesidei not the whole crew '
-1,'Need a hug '
1,'@Cliff_Forster Yeah, that does work better than just waiting for it  In the end I just wonder if I have time to keep up a good blog.'
1,'Just woke up. Having no school is the best feeling ever '
1,'TheWDB.com - Very cool to hear old Walt interviews!  ? http://blip.fm/~8bmta'
1,'Are you ready for your MoJo Makeover? Ask me for details '
1,'Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur '
1,'happy #charitytuesday @theNSPCC @SparksCharity @SpeakingUpH4H '

I want to convert the values of second attribute into equivalent TF-IDF values.

Btw, I tried following code but its output ARFF file doesn't contain first attribute for positive(1) values for respective instances.

// Set the tokenizer
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters("\\W");

// Set the filter
StringToWordVector filter = new StringToWordVector();
filter.setAttributeIndicesArray(new int[]{1});
filter.setOutputWordCounts(true);
filter.setTokenizer(tokenizer);
filter.setInputFormat(inputInstances);
filter.setWordsToKeep(1000000);
filter.setDoNotOperateOnPerClassBasis(true);
filter.setLowerCaseTokens(true);
filter.setTFTransform(true);
filter.setIDFTransform(true);

// Filter the input instances into the output ones
outputInstances = Filter.useFilter(inputInstances, filter);

Sample output ARFF file:

@data
{0 -1,320 1,367 1,374 1,397 1,482 1,537 1,553 1,681 1,831 1,1002 1,1033 1,1112 1,1119 1,1291 1,1582 1,1618 1,1787 1,1810 1,1816 1,1855 1,1939 1,1941 1}
{0 -1,72 1,194 1,436 1,502 1,740 1,891 1,935 1,1075 1,1256 1,1260 1,1388 1,1415 1,1579 1,1611 1,1818 2,1849 1,1853 1}
{0 -1,374 1,491 1,854 1,873 1,1120 1,1121 1,1197 1,1337 1,1399 1,2019 1}
{0 -1,240 1,359 2,369 1,407 1,447 1,454 1,553 1,1019 1,1075 3,1119 1,1240 1,1244 1,1373 1,1379 1,1417 1,1599 1,1628 1,1787 1,1824 1,2021 1,2075 1}
{0 -1,198 1,677 1,1379 1,1818 1,2019 1}
{0 -1,320 1,1070 1,1353 1}
{0 -1,210 1,320 2,477 2,867 1,1020 1,1067 1,1075 1,1212 1,1213 1,1240 1,1373 1,1404 1,1542 1,1599 1,1628 1,1815 1,1847 1,2067 1,2075 1}
{179 1,1815 1}
{298 1,504 1,662 1,713 1,752 1,1163 1,1275 1,1488 1,1787 1,2011 1,2075 1}
{144 1,785 1,1274 1}
{19 1,256 1,390 1,808 1,1314 1,1350 1,1442 1,1464 1,1532 1,1786 1,1823 1,1864 1,1908 1,1924 1}
{84 1,186 1,320 1,459 1,564 1,636 1,673 1,810 1,811 1,966 1,997 1,1094 1,1163 1,1207 1,1592 1,1593 1,1714 1,1836 1,1853 1,1964 1,1984 1,1997 2,2058 1}
{9 1,1173 1,1768 1,1818 1}
{86 1,935 1,1112 1,1337 1,1348 1,1482 1,1549 1,1783 1,1853 1}

As you can see that first few instances are okay(as they contains -1 class along with other features), but the last remaining instances don't contain positive class attribute(1).

I mean, there should have been {0 1,...} as very first attribute in the last instances in output ARFF file, but it is missing.

score 0 · Answer 1 · answered Aug 25 '16 at 18:51

You have to specify which is your class attribute explicitly in the java program as, when you apply StringToWordVector filter your input gets divided among specified n-grams. Hence class attribute location changes once StringToWordVector vectorizes the input. You can just use Reorder filer which will ultimately place class attribute at last position and Weka will pick last attribute as class attribute.

More info about Reordering in Weka can be found at http://weka.sourceforge.net/doc.stable-3-8/weka/filters/unsupervised/attribute/Reorder.html. Also example 5 at http://www.programcreek.com/java-api-examples/index.php?api=weka.filters.unsupervised.attribute.Reorder may help you in doing reordering.

Hope it helps.

score 0 · Answer 2 · answered Nov 23 '19 at 20:17

Your process for obtaining TF-IDF seems correct.

According to my experiments, if you have n classes, Weka shows information labels for records for each n-1 classes and records for the n^th class are implied.

In your case, you have 2 classes -1 and 1, so weka is showing labels in records with class label -1 and records with label 1 are implied.

How do I transform text into TF-IDF format using Weka in Java?

2 Answers2