Switch from Guava CharMatcher to Regex

Question

I'm currently using the following CharMatcher algorithm to parse out all @Mentions in the twitter status in a file of 10 million tweets. It appears to be eating up a great deal of memory. Running the Netbeans profiler, it appears to create a LOT of char[] arrays, which I can only assume is from the CharMatcher solution I implemented.

Can anyone either recommend a MUCH more efficient CharMatcher/Strings method OR a regex solution (which I assume would be more efficient in terms of object creation)? Speed is not my primary concern....

@Override
public boolean filter(Tweet msg) {

    List<String> statusList = Splitter.on(CharMatcher.BREAKING_WHITESPACE).trimResults().omitEmptyStrings().splitToList(msg.getStatusText());

    for (int i = 0; i < statusList.size(); i++) {
        if (statusList.get(i).contains("@")) {
            insertTwitterLegalUsernames(statusList.get(i), msg);
        }
    }

    if (msg.hasAtMentions()) {
        Statistics.increaseNumTweetsWithAtMentions();
    }

    statusList = null;
    return msg.hasAtMentions();
}

private void insertTwitterLegalUsernames(String token, Tweet msg) {
    token = token.substring(token.indexOf("@"), token.length());
    List<String> splitList = Splitter.on(CharMatcher.inRange('0', '9').or(CharMatcher.inRange('a', 'z')).or(CharMatcher.inRange('A', 'Z')).or(CharMatcher.anyOf("_@")).negate()).splitToList(token);
    for (int j = 0; j < splitList.size(); j++) {
        if (splitList.get(j).length() > 1 && splitList.get(j).contains("@")) {
            String finalToken = splitList.get(j).substring(splitList.get(j).lastIndexOf("@") + 1, splitList.get(j).length());
            if (!finalToken.equalsIgnoreCase(msg.getUserScreenNameString())) {
                msg.addAtMentions(finalToken);
            }
        }
    }

}

The expected input could be anything with username's throughout it. I want to extract the username which is considered legal beginning with an '@' and followed by any number of number or character 'a' - 'z', 'A' - 'Z', 0-9 and '_', beginning with an '@'.

Should there be any illegal characters immediately following the '@', we would disregard, however we would expect to extract usernames that are either before or after either other legal usernames or illegal characters.

As an example input:

"!@@@Mike,#Java@Nancy_2,this this on for size"

Should return:

Mike

Nancy_2

The answer should be valid for use in Java.

The fact that all this logic, that changes various mutable objects, is in a method called `filter` genuinely scares me. I hope never to encounter the code base this comes from. — Boris the Spider, Mar 26 '15 at 17:21
For better help sooner post examples of expected input and output - before and after Tweets for example. — Boris the Spider, Mar 26 '15 at 17:22
It's a small app to concert some internal files from one format to another with some string manipulation, this isn't production code. The expected input could be anything with username's throughout it. I want to extract the username which is legal with any character 'a' - 'z', 'A' - 'Z', 0-9 and '_', beginning with an '@'. Should there be any illegal characters immediately following the '@', we would disregard, however we would expect to extract usernames that are either before or after either other legal usernames or illegal characters. — Brooks, Mar 26 '15 at 17:23
Be it production code or not, bad habits die hard - and this is a **very** bad habit. I would **never** expect a method that filters a `Collection` to _change_ the elements it filters. — Boris the Spider, Mar 26 '15 at 17:24
Post relevant information in the question, not in comments. A general explanation is not helpful - we have that in the code. We need test cases that we can easily use and verify that any suggestions function correctly. These need to be tests cases that mean code posted in answers functions as **you** would expect. — Boris the Spider, Mar 26 '15 at 17:28
Understood and I appreciate all constructive advice, though that was my design choice and not really the subject of the question. If you have some help or advice to offer on what I asked, i'm really interested in it. — Brooks, Mar 26 '15 at 17:34
I have suggested what you need to do to get useful advice from this forum. I also note that you use by index access on your `List`s. How do you know that they implement `RandomAccess`? Why do you use by index access rather than the more modern `foreach` loop? If these `List` implementations happen to be a, for example, `LinkedList` then each call to `get` will be `O(n)` - i.e. **very** expensive. — Boris the Spider, Mar 26 '15 at 17:35
CharMatcher is seriously unlikely to incur more allocation than regex. The other way around is much more likely. — Louis Wasserman, Mar 27 '15 at 02:21

score 5 · Accepted Answer · edited May 23 '17 at 12:09

From you explanation:

The expected input could be anything with username's throughout it. I want to extract the username which is legal with any character 'a' - 'z', 'A' - 'Z', 0-9 and '_', beginning with an '@'. Should there be any illegal characters immediately following the '@', we would disregard, however we would expect to extract usernames that are either before or after either other legal usernames or illegal characters

It seems we are searching for [\w] (which is shorthand for [a-zA-Z0-9_]) that is immediately preceded by an @. This is very simply in Regex, the main worry is to eliminate backtracking and the cost of almost-matches.

The pattern:

(?<=@)[\w]++

Will do exactly what you ask.

Breaking the pattern down:

(?<=@) is a positive lookbehind assertion, to check that an @ precedes this match
[\w]++ possessively matches the name itself, it must contain at least one character.

First, declare the Pattern globally. It's thread safe and should be reused.

private static final Pattern TWITTER_NAME = Pattern.compile("(?<=@)[\\w]++")

You can then use a method such as this one to extract (unique) usernames:

public static Set<String> findNames(final String input) {
    final Matcher matcher = TWITTER_NAME.matcher(input);
    final Set<String> names = new HashSet<>();
    while (matcher.find()) {
        names.add(matcher.group());
    }
    return names;
}

Note that you can also reuse the Matcher with reset(String), but a Matcher is not threadsafe - you could consider using ThreadLocal matcher instances to boost performance if necessary. If not using multiple threads, then you can use a global Matcher too.

Testing with your input:

public static void main(final String[] args) throws Exception {
    System.out.println(findNames("!@@@Mike,#Java@Nancy_2,this this on for size"));
}

Yields:

[Mike, Nancy_2]

As a side note, you are looping by-index over all your Lists. This is a very bad idea - especially as you have no idea what type of List Splitter.splitToList returns. If it happens to be a LinkedList then access by-index is O(n) so whilst this loop:

for(final String s : myList) {
    System.out.println(s);
}

is obviously O(n), the same loop by index:

for(int i = 0; i < myList.size(); ++i) {
    System.out.println(myList.get(i));
}

could easily be O(n^2). This is a massive performance penalty for absolutely no reason.

TL;DR: Never use a by-index loop unless you:

know that your List is RandomAccess; and
really need the index for some reason.

Further addendum, if you want to be Java 8-y, you could use the following code to wrap a Matcher in a Spliterator:

public class MatcherSpliterator extends AbstractSpliterator<MatchResult> {

    private final Matcher m;

    public MatcherSpliterator(final Matcher m) {
        super(Long.MAX_VALUE, ORDERED | NONNULL | IMMUTABLE);
        this.m = m;
    }

    @Override
    public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if (!m.find()) {
            return false;
        }
        action.accept(m.toMatchResult());
        return true;
    }
}

And then a simple method to return match results in a Stream:

public static Stream<MatchResult> extractMatches(final Pattern pattern, final String input) {
    return StreamSupport.stream(new MatcherSpliterator(pattern.matcher(input)), false);
}

And now your method becomes:

public static Set<String> findNames(final String input) {
    return extractMatches(TWITTER_NAME, input)
            .map(MatchResult::group)
            .collect(toSet());        
}

Inspiration from this SO answer

Thank you, i'm still having OutOfMemory issues for an unknown reason, but it answered the mail. I was originally using the enhanced for loop, but switched to index because I was trying to cut down on object creation, rather than worry about speed. This was prior to using Profile mode to see where the issue lied. — Brooks, Mar 26 '15 at 18:38
@Brooks looking at the logic in your code - you are parsing the entire Tweet for each name found in the title. This is obviously rather inefficient, consider parsing the title to a `Set`, then the Tweet to a `Set` then using [`Set.retainlAll`](http://docs.oracle.com/javase/7/docs/api/java/util/Set.html#retainAll(java.util.Collection)) to find the union. — Boris the Spider, Mar 27 '15 at 09:10
@Brooks what on earth makes you think that a `foreach` loops creates object?! Java is pass reference by value! — Boris the Spider, Mar 27 '15 at 10:44
@BoristheSpider well the `foreach` loop does create the `Iterator` which might be a concern in the inner loop in `insertTwitterLegalUsernames`, except the method actually recreates the (immutable) `Splitter` and `CharMatcher` every time, which probably cost more :) So it's really an premature optimization not based on any measurement. — Frank Pavageau, Mar 27 '15 at 11:18

Switch from Guava CharMatcher to Regex

1 Answers1