I'm currently using the following CharMatcher algorithm to parse out all @Mentions in the twitter status in a file of 10 million tweets. It appears to be eating up a great deal of memory. Running the Netbeans profiler, it appears to create a LOT of char[] arrays, which I can only assume is from the CharMatcher solution I implemented.
Can anyone either recommend a MUCH more efficient CharMatcher/Strings method OR a regex solution (which I assume would be more efficient in terms of object creation)? Speed is not my primary concern....
@Override
public boolean filter(Tweet msg) {
List<String> statusList = Splitter.on(CharMatcher.BREAKING_WHITESPACE).trimResults().omitEmptyStrings().splitToList(msg.getStatusText());
for (int i = 0; i < statusList.size(); i++) {
if (statusList.get(i).contains("@")) {
insertTwitterLegalUsernames(statusList.get(i), msg);
}
}
if (msg.hasAtMentions()) {
Statistics.increaseNumTweetsWithAtMentions();
}
statusList = null;
return msg.hasAtMentions();
}
private void insertTwitterLegalUsernames(String token, Tweet msg) {
token = token.substring(token.indexOf("@"), token.length());
List<String> splitList = Splitter.on(CharMatcher.inRange('0', '9').or(CharMatcher.inRange('a', 'z')).or(CharMatcher.inRange('A', 'Z')).or(CharMatcher.anyOf("_@")).negate()).splitToList(token);
for (int j = 0; j < splitList.size(); j++) {
if (splitList.get(j).length() > 1 && splitList.get(j).contains("@")) {
String finalToken = splitList.get(j).substring(splitList.get(j).lastIndexOf("@") + 1, splitList.get(j).length());
if (!finalToken.equalsIgnoreCase(msg.getUserScreenNameString())) {
msg.addAtMentions(finalToken);
}
}
}
}
The expected input could be anything with username's throughout it. I want to extract the username which is considered legal beginning with an '@' and followed by any number of number or character 'a' - 'z', 'A' - 'Z', 0-9 and '_', beginning with an '@'.
Should there be any illegal characters immediately following the '@', we would disregard, however we would expect to extract usernames that are either before or after either other legal usernames or illegal characters.
As an example input:
"!@@@Mike,#Java@Nancy_2,this this on for size"
Should return:
Mike
Nancy_2
The answer should be valid for use in Java.