1

I'm using Mallet 2.0.7 in java for mining of tweets. According the documentation, for topic modeling I have to read data set using CsvIterator.

Reader fileReader = new InputStreamReader(new FileInputStream(new File(args[0])), "UTF-8");
    instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
                                           3, 2, 1)); // data, label, name fields

My data set is like: row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment

for label I added column x. in the first time, I want to run algorithm in column text (6) and later added another column. I wrote this pattern but it doesn't work as expected, It gets column 6 until last for data. how do I change the regular expression for pattern?

 Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
    instances.addThruPipe(new CsvIterator(fileReader,
            Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
            6, 2, 1)); // data, label, name fields
Sanath
  • 4,774
  • 10
  • 51
  • 81
NASRIN
  • 475
  • 7
  • 22

1 Answers1

1

Look for regular expression documentation to understand the meaning of each element of the pattern. The original pattern divides the whole line into three groups: all characters from the beginning to the first comma or whitespace, all characters until the second comma or whitespace, and then everything else.

The new pattern does the same, but captures six groups. That's why you're getting everything from the text to the end of the line.

I would recommend a few fixes:

  • If a field isn't relevant, like label, you can just use 0 to specify that it doesn't exist. You don't need to add a dummy field.

  • Anything in () is a capturing group. If you don't want to include a field, don't capture it. Just delete the parentheses but leave the pattern.

  • The original pattern works because we can make assumptions about the name and label fields: they don't contain commmas or spaces, and everything afterwards is text. To grab a field in the middle of a line, you need to be more careful. You have to find the end of the text field. I would strongly suggest using tab-delimited fields, assuming no field contains tab characters.

Try something like this (not tested):

// row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
instances.addThruPipe(new CsvIterator(fileReader,
        Pattern.compile("^(\d+)\t[^\t]*\t[^\t]*\t[^\t]*\t([^\t]*)\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$"),
        2, 0, 1)); // data, label, name fields
David Mimno
  • 1,836
  • 7
  • 7
  • Thank you for answer! I guess parameter 3 of function compile (data group ) is index of text that I want to detect topics in it, for this reason I passed 6 for it. You passed 2 that means in my data set, it means 2 columns (text and retweets) ?? – NASRIN Oct 19 '17 at 14:47
  • in my pre-processioning step, I removed comma, stop words and stemmed tweets. for that reason I used comma for delimiter and I changed pattern by your guidance. "^(\\d+)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*([^,]*)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*[^,]*$" but I'm still doubt about my program, Do you have any example of mallet, except of mail mallet site? – NASRIN Oct 19 '17 at 15:44