Find multiple occurrences of words in a string and store the respective staring indices

Question

BACKGROUND

I have a string of text and a hash set that contains words that i am looking for.

Given

String doc = "one of the car and bike and one of those";
String [] testDoc = doc.split("\\s+");
HashSet<String> setW = new HashSet<>();
setW.add("and");
setW.add("of");
setW.add("one");

OBJECTIVE

The objective is to scan the string and each time we come across a word that is in the hash set we are to store the word and the position of the starting index.

In the above case we should be able to store the following

one-->0 

of-->4 

and-->15 

and-->24, 

one-->28, 

of-->32

` ATTEMPT

//create hashmap
for(int i = 0; i<testDoc.length; i++){
    if(setW.contains(testDoc[i])) {
        doc.indexOf(testDoc[i]);
       //add string and its index to hashmap
    }

That is what i have thought of so far the only problem is that the indexOf method only looks at the first occurrence of a word so i am not sure what to do. If i keep trimming the string after each word scanned then i will not the get index position of a word in the original string.

I would love some input here.

score 3 · Accepted Answer · answered Jun 05 '19 at 23:19

3

There is an overloaded version of indexOf() which takes an index to start the search at. You can use this to repeatedly search for the same string until you reach the end.

Note that you can remove your test for contains() so that you don't search the string twice.

answered Jun 05 '19 at 23:19

Code-Apprentice

81,660
23
145
268

silly me! i should have known better. that works for me! I can accept your answer in 6 mins. – Dinero Jun 05 '19 at 23:22
@Dinero Note that what I suggest here requires more modifications to your code than I first thought. Rather than iterating over the words in `testDoc`, you should iterate over the words in `setW` then search for them in `doc`. This eliminates the need for `testDoc` entirely. – Code-Apprentice Jun 05 '19 at 23:24
1

@Dinero There is another way to use `testDoc` and calculate the index as you iterate. The first word in `testDoc` starts at index 0. The next word starts at index `testDoc[0].length() + 1`. And so on. This eliminates the need to call `indexOf()` at all. – Code-Apprentice Jun 05 '19 at 23:26
agree with your second comment, The first comment where you suggest i should iterate over words in setW would mean i would have to do multiple iteration. In the later case i can figure out all the occurring words and indexes in one scan. – Dinero Jun 05 '19 at 23:32
@Dinero Either way, both algorithms I describe require multiple iterations. By that I mean you have to iterate over both the words in `setW` and the words in `doc`. – Code-Apprentice Jun 06 '19 at 15:54
@Dinero Note how in both cases, I describe an algorithm **in words**. This is often a good way to start solving a problem before translating the steps into code. – Code-Apprentice Jun 06 '19 at 15:56

score 0 · Answer 2 · answered Jun 05 '19 at 23:53

Convert the list of words into a regex, and let the regex do the searching for you.

E.g. your 3 words would be a regex like this:

and|of|one

Of course, you wouldn't want partial words, so you'd add word boundary checks:

\b(and|of|one)\b

No need to capture the word (again), since the entire match is the word, so use a non-capturing group. You can also easily make the word search case-insensitive.

Although there will never be a problem with pure words (all letters), it's a good idea to guard the regex by quoting the words using Pattern.quote().

Example

String doc = "one of the car and bike and one of those";
String[] words = { "and", "of", "one" };

// Build regex
StringJoiner joiner = new StringJoiner("|", "\\b(?:", ")\\b");
for (String word : words)
    joiner.add(Pattern.quote(word));
String regex = joiner.toString();

// Find words
for (Matcher m = Pattern.compile(regex, Pattern.CASE_INSENSITIVE).matcher(doc); m.find(); )
    System.out.println(m.group() + "-->" + m.start());

Output

one-->0
of-->4
and-->15
and-->24
one-->28
of-->32

If you want to compress (obfuscate) the code a bit, you can write it as a single statement in Java 9+:

Pattern.compile(Stream.of(words).collect(joining("|", "(?i)\\b(?:", ")\\b"))).matcher(doc).results().forEach(r -> System.out.println(r.group() + "-->" + r.start()));

Output is the same.

score 0 · Answer 3 · answered Jun 07 '19 at 17:27

Well, There is another solution if you want to make less iteration, this code traverses the string once. I thought of accessing a string character by character. I took one StringBuilder to append each character and check when you get the whitespace just append that string to the final answer list, as well as add the index too. I have described my approach as below and I think it's just visiting each character once, the time complexity for this code is O(n).

StringBuilder sb=new StringBuilder();
    ArrayList<String> answer=new ArrayList<>();
    ArrayList<Integer> index=new ArrayList<>();
    HashSet<String> setW = new HashSet<>();
    setW.add("and");
    setW.add("of");
    setW.add("one");
    index.add(0);
    String doc = "one of the car and bike and one of those";
    for(int i=0;i<doc.length();i++){
        if(i==doc.length() || doc.charAt(i)==' '){
            index.add(i+1);
            answer.add(sb.toString());
            sb=new StringBuilder();
            i++;
        }
        sb.append(doc.charAt(i));
        if(i==doc.length()-1){
            if(setW.contains(sb.toString())){
                answer.add(sb.toString());
            };
        }
    }
    for(int i=0;i<answer.size();i++){
        if(setW.contains(answer.get(i))){
            System.out.println(answer.get(i)+"-->"+index.get(i));
        }
    }

I got the expected output based on this idea, the reason behind submitting my answer to this question is to get another possible solution. (In answer HashSet we will end up with an index of every word not only which are exist in setW, so If you don't want that you can remove it using one if(!setW.contains(answer.get(i)) condition.)

Output

one-->0
of-->4
and-->15
and-->24
one-->28
of-->32

Find multiple occurrences of words in a string and store the respective staring indices

3 Answers3