Collect HashSet / Java 8 / Regex Pattern / Stream API

Question

Recently I change version of the JDK 8 instead 7 of my project and now I overwrite some code snippets using new features that came with Java 8.

final Matcher mtr = Pattern.compile(regex).matcher(input);

HashSet<String> set = new HashSet<String>() {{
    while (mtr.find()) add(mtr.group().toLowerCase());
}};

How I can write this code using Stream API ?

Marko Topolnik · Accepted Answer · 2014-07-09T20:49:42.043

A Matcher-based spliterator implementation can be quite simple if you reuse the JDK-provided Spliterators.AbstractSpliterator:

public class MatcherSpliterator extends AbstractSpliterator<String[]>
{
  private final Matcher m;

  public MatcherSpliterator(Matcher m) {
    super(Long.MAX_VALUE, ORDERED | NONNULL | IMMUTABLE);
    this.m = m;
  }

  @Override public boolean tryAdvance(Consumer<? super String[]> action) {
    if (!m.find()) return false;
    final String[] groups = new String[m.groupCount()+1];
    for (int i = 0; i <= m.groupCount(); i++) groups[i] = m.group(i);
    action.accept(groups);
    return true;
  }
}

Note that the spliterator provides all matcher groups, not just the full match. Also note that this spliterator supports parallelism because AbstractSpliterator implements a splitting policy.

Typically you will use a convenience stream factory:

public static Stream<String[]> matcherStream(Matcher m) {
  return StreamSupport.stream(new MatcherSpliterator(m), false);
}

This gives you a powerful basis to concisely write all kinds of complex regex-oriented logic, for example:

private static final Pattern emailRegex = Pattern.compile("([^,]+?)@([^,]+)");
public static void main(String[] args) {
  final String emails = "kid@gmail.com, stray@yahoo.com, miks@tijuana.com";
  System.out.println("User has e-mail accounts on these domains: " +
      matcherStream(emailRegex.matcher(emails))
      .map(gs->gs[2])
      .collect(joining(", ")));
}

Which prints

User has e-mail accounts on these domains: gmail.com, yahoo.com, tijuana.com

For completeness, your code will be rewritten as

Set<String> set = matcherStream(mtr).map(gs->gs[0].toLowerCase()).collect(toSet());

Nicely done! It would be good if something like this were migrated into the `Matcher` API itself. — Stuart Marks, Jul 10 '14 at 07:34
Note: an API `Matcher.results()` returning `Stream` has been integrated into JDK 9: https://bugs.openjdk.java.net/browse/JDK-8071479 — Stuart Marks, Mar 05 '15 at 05:15
@StuartMarks +1 and I'm especially looking forward to `replaceAll(Function)`, which is a substitute for possibly the ugliest boilerplate idiom left over from early versions: `StringBuffer b = ...; while (m.find()) { ... m.appendReplacement(b, ...); } m.appendTail(b);` — Marko Topolnik, Mar 05 '15 at 09:05

score 9 · Answer 2 · edited May 23 '17 at 12:01

Marko's answer demonstrates how to get matches into a stream using a Spliterator. Well done, give that man a big +1! Seriously, make sure you upvote his answer before you even consider upvoting this one, since this one is entirely derivative of his.

I have only a small bit to add to Marko's answer, which is that instead of representing the matches as an array of strings (with each array element representing a match group), the matches are better represented as a MatchResult which is a type invented for this purpose. Thus the result would be a Stream<MatchResult> instead of Stream<String[]>. The code gets a little simpler, too. The tryAdvance code would be

    if (m.find()) {
        action.accept(m.toMatchResult());
        return true;
    } else {
        return false;
    }

The map call in his email-matching example would change to

    .map(mr -> mr.group(2))

and the OP's example would be rewritten as

Set<String> set = matcherStream(mtr)
                      .map(mr -> mr.group(0).toLowerCase())
                      .collect(toSet());

Using MatchResult gives a bit more flexibility in that it also provides offsets of match groups within the string, which could be useful for certain applications.

+1 and thanks for the praise :) I never took note of the `MatchResult` part of `Matcher` API, it's definitely the way to go. — Marko Topolnik, Jul 10 '14 at 08:26
Note: an API `Matcher.results()` returning `Stream` has been integrated into JDK 9: https://bugs.openjdk.java.net/browse/JDK-8071479 — Stuart Marks, Mar 05 '15 at 05:15

score 8 · Answer 3 · answered Jul 09 '14 at 18:17

8

I don't think you can turn this into a Stream without writing your own Spliterator, but, I don't know why you would want to.

Matcher.find() is a state changing operation on the Matcher object so running each find() in a parallel stream would produce inconsistent results. Running the stream in serial wouldn't have better performance that the Java 7 equivalent and would be harder to understand.

answered Jul 09 '14 at 18:17

dkatzel

31,188
3
63
67

I don't want write loop for it, and it will better to write inline. Regarding performance in this case `Matcher` will be have small count of the groups. – Anton Dozortsev Jul 09 '14 at 19:02
If you wan't onliner why dont you create a method for it. And then use it like this `Set groups = MatcherUtil.groupsOfPattern(pattern,input);` – Panu Jul 09 '14 at 19:03
2

I would argue that a Matcher-based Stream would be a very welcome feature. For example, Clojure offers `re-seq` as the primary primitive to use for regex processing. – Marko Topolnik Jul 09 '14 at 19:31

score 3 · Answer 4 · answered Jul 10 '14 at 07:33

3

What about Pattern.splitAsStream ?

Stream<String> stream = Pattern.compile(regex).splitAsStream(input);

and then a collector to get a set.

Set<String> set = stream.map(String::toLowerCase).collect(Collectors.toSet());

answered Jul 10 '14 at 07:33

gontard

28,720
11
94
117

1

Useful in some cases, but doesn't match the OP's question. The OP has a pattern that matches stuff he wants to put into the set. `splitAsStream` matches delimiters, stuff **between** the values that end up in the stream and eventually into the destination set. – Stuart Marks Jul 10 '14 at 08:42
You are right, but the OP may adapt its code. Anyway it is interesting to be aware of this alternative, especially because the others "java 8 style" answers propose a much more uglier code that the java 7 one. – gontard Jul 10 '14 at 08:53
Remember that beauty is in the eye of the beholder. Many will agree there are few idioms in Java uglier than creating an anonymous HashSet subclass just to save a line of code. Also, proposing an API extension should not be equalled with proposing ugly client code. Finally, twisting a regex solution from positive to negative match just to avoid writing such an extension can easily result in code which is not just ugly, but incorrect. – Marko Topolnik Jul 10 '14 at 17:15

score 1 · Answer 5 · answered Apr 02 '16 at 17:44

What about

public class MakeItSimple {

public static void main(String[] args) throws FileNotFoundException  {

    Scanner s = new Scanner(new File("C:\\Users\\Admin\\Desktop\\TextFiles\\Emails.txt"));

    HashSet<String> set = new HashSet<>();          
    while ( s.hasNext()) {
       String r = s.next();
       if (r.matches("([^,]+?)@([^,]+)")) {
          set.add(r);
       }
    }   
    set.stream().map( x -> x.toUpperCase()).forEach(x -> print(x)); 
    s.close();
  }
}

score 0 · Answer 6 · answered Jul 09 '14 at 19:02

0

Here is the implementation using Spliterator interface.

    // To get the required set
   Set<String> result = (StreamSupport.stream(new MatcherGroupIterator(pattern,input ),false))
           .map( s -> s.toLowerCase() )
           .collect(Collectors.toSet());
    ...
    private static class MatcherGroupIterator implements Spliterator<String> {
      private final Matcher matcher;

      public MatcherGroupIterator(Pattern p, String s) {
        matcher = p.matcher(s);
      }

      @Override
      public boolean tryAdvance(Consumer<? super String> action) {
        if (!matcher.find()){
            return false;
        }
        action.accept(matcher.group());
        return true;
      }

      @Override
      public Spliterator<String> trySplit() {
        return null;
      }

      @Override
      public long estimateSize() {
        return Long.MAX_VALUE;
      }

      @Override
      public int characteristics() {
        return Spliterator.NONNULL;
      }
  }

answered Jul 09 '14 at 19:02

Panu

362
3
13

3

I think this code example shows why it's not worth converting your Java 7 code into a Java 8 Stream. Even if you say "well, I'll only have to write this Spliterator once", your actual stream code is just as verbose as your Java 7 version PLUS you now have an additional class to maintain – dkatzel Jul 09 '14 at 19:08
2

This could have been quite a bit shorter---*and* supported parallelism---if it was based on `AbstractSpliterator`. Another thing: each match should not be just a string, but a list of all available matcher *groups*. – Marko Topolnik Jul 09 '14 at 20:10
1

@dkatzel If one's only use case is as simple as OP's, then writing a whole Splitetor for it isn't much of a benefit. However, having the regex match results available in the form of a Stream supports a much wider range of much more complex use cases. – Marko Topolnik Jul 09 '14 at 20:13
@MarkoTopolnik I don't think parallelism can be supported since the Match state is mutable and is updated each time `find()` is called. The javadoc for Matcher explicitly states that the "instances are NOT threadsafe for use by multiple concurrent threads" – dkatzel Jul 09 '14 at 20:16
2

@dkatzel `spliterators are not expected to be thread-safe; instead, implementations of parallel algorithms using spliterators should ensure that the spliterator is only used by one thread at a time.` (from Spliterator javadoc) – Marko Topolnik Jul 09 '14 at 20:24

Collect HashSet / Java 8 / Regex Pattern / Stream API

6 Answers6

Linked