25

So, how would you go about converting

String csv = "11,00,33,66,44,33,22,00,11";

to a hashset in the quickest-most optimized way.

This is for a list of user-ids.

Update

I ran all the answers provided through a test program where each method was called 500,000 times for a bigger CSV string. This test was performed 5 times continously (in case program startup slowed initial method) and I got the following in milliseconds (ms):

Method One Liner->  6597
Method Split&Iterate->  6090
Method Tokenizer->  4306
------------------------------------------------
Method One Liner->  6321
Method Split&Iterate->  6012
Method Tokenizer->  4227
------------------------------------------------
Method One Liner->  6375
Method Split&Iterate->  5986
Method Tokenizer->  4340
------------------------------------------------
Method One Liner->  6283
Method Split&Iterate->  5974
Method Tokenizer->  4302
------------------------------------------------
Method One Liner->  6343
Method Split&Iterate->  5920
Method Tokenizer->  4227
------------------------------------------------


static void method0_oneLiner() {
        for (int j = 0; j < TEST_TIMES; j++) {
            Set<String> hashSet = new HashSet<String>(Arrays.asList(csv
                    .split(",")));
        }
    }

    // ———————————————————————————————–

    static void method1_splitAndIterate() {

        for (int j = 0; j < TEST_TIMES; j++) {
            String[] values = csv.split(",");
            HashSet<String> hSet = new HashSet<String>(values.length);
            for (int i = 0; i < values.length; i++)
                hSet.add(values[i]);
        }
    }

    static void method2_tokenizer() {

        for (int j = 0; j < TEST_TIMES; j++) {
            HashSet<String> hSet = new HashSet<String>();
            StringTokenizer st = new StringTokenizer(csv, ",");
            while (st.hasMoreTokens())
                hSet.add(st.nextToken());
        }
    }
Menelaos
  • 23,508
  • 18
  • 90
  • 155
  • How many of those numbers do you have, or how have you determined that this particular code needs to be "quickest-most optimized"? – Kayaman Sep 25 '13 at 11:05
  • I'm writing an analysis algorithm, and because I'm working with a dataset (noSQL DB :( ) that is giant, we are separating the dataset to smaller sets and then converting to hashsets in memory for a specific problem. I profiled this and it does eat up minutes each time so I'd like to have the fastest available option that doesn't involve writing it in C, or converting the data in the nosql db. I actually don't have access to the data. – Menelaos Sep 25 '13 at 11:10
  • See my provided answer for a slightly optimized version. It'll be hard to top that, except maybe using a StreamTokenizer (if you can get the data as a stream from the DB). – Kayaman Sep 25 '13 at 11:24

10 Answers10

34
String[] values = csv.split(",");
Set<String> hashSet = new HashSet<String>(Arrays.asList(values));
TheKojuEffect
  • 20,103
  • 19
  • 89
  • 125
15

The 6 other answers are great, in that they're the most straight-forward way of converting.

However, since String.split() involves regexps, and Arrays.asList is doing redundant conversion, you might want to do it this way, which may improve performance somewhat.

Edit if you have a general idea on how many items you will have, use the HashSet constructor parameter to avoid unnecessary resizing/hashing :

HashSet<String> myHashSet = new HashSet(500000);  // Or a more realistic size
StringTokenizer st = new StringTokenizer(csv, ",");
while(st.hasMoreTokens())
   myHashSet.add(st.nextToken());
Kayaman
  • 72,141
  • 5
  • 83
  • 121
  • Yes, this is consecutively the fastest solution. This is true even when csv elements are not larger than hashSet initial capacity. – Menelaos Sep 25 '13 at 11:33
  • 1
    As SagarG pointed out, `StringTokenizer` usage is now discouraged, as it's a legacy class. The documentation recommends using the `java.util.regex` package instead (http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html). – Pietro Saccardi Nov 08 '15 at 23:35
  • 4
    @PietroSaccardi I appreciate your answer, and I wouldn't necessarily start using `StringTokenizer` in new code, however using regular expressions is slow, which was clear both in the question and in my answer. I suspect that `Scanner` may be used to avoid both the legacy and slow aspects. – Kayaman Nov 16 '15 at 13:36
10
Arrays.stream(csv.split(",")).collect(Collectors.toSet());
lkogs
  • 53
  • 10
dripp
  • 147
  • 1
  • 5
5

You can try

Set<String> set= new HashSet<String>(Arrays.asList(yourString.split(",")));
Suresh Atta
  • 120,458
  • 37
  • 198
  • 307
3

Try this:

Set<String> hashSet = new HashSet<>(Arrays.asList(csv.split(",")));

But be careful, this is maybe the easiest way to do it, but not necessarily the optimal.

aUserHimself
  • 1,589
  • 2
  • 17
  • 26
1
String[] array= csv.split(",");

Set<String> set = new HashSet<String>(Arrays.asList(array));
Prabhakaran Ramaswamy
  • 25,706
  • 10
  • 57
  • 64
1

The current accepted answer by @Kayaman is good but I have something to add from the Java API webpage. I was unable to add this as a comment to the answer because of not having enough reputation.

Use of StringTokenizer is discouraged. It is mentioned on the Java API webpage here http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
SagarG
  • 35
  • 1
  • 8
  • This should rather be an edit to the accepted answer, as it's not an answer on its own. – Pietro Saccardi Nov 08 '15 at 19:13
  • Dude Pietro. It tried to add this as a comment to the original answer first but system said I did not have enough reps to comment. Then I tried to edit the answer but my edit was rejected by a 'Peer' stating that this should go as a comment and not as an edit. Ultimately, this was the only way to post my thoughts. – SagarG Nov 08 '15 at 23:11
  • so much for the sake of knowledge :D I can post it as a comment if you want, and end your quest :D – Pietro Saccardi Nov 08 '15 at 23:16
  • Lol. Please go ahead. – SagarG Nov 08 '15 at 23:17
  • If you're acquainted with the `java.util.regex` package that you mentioned, I suggest you add a solution using it to your answer, to make it useful for other readers. – Pietro Saccardi Nov 08 '15 at 23:36
  • @SagarG I appreciate your answer, and I wouldn't necessarily start using StringTokenizer in new code, however using regular expressions is slow, which was clear both in the question and in my answer. I suspect that Scanner may be used to avoid both the legacy and slow aspects. – Kayaman Nov 16 '15 at 13:36
0

try,

String[] splitValues = csv.split(",");
Set<String> set = new HashSet<String>(Arrays.asList(splitValues));

and also use

CollectionUtils

collectionutils.addall();
newuser
  • 8,338
  • 2
  • 25
  • 33
0

try

String[] args = csv.split(",");
Set<String> set = new HashSet<String>(Arrays.asList(args));
sunysen
  • 2,265
  • 1
  • 12
  • 13
0

With newer java versions :

import java.util.Set;
Set<String> hashSet = Set.of(csv.split(","));