13

I want to split a string into tokens.

I ripped of another Stack Overflow question - Equivalent to StringTokenizer with multiple characters delimiters, but I want to know if this can be done with only string methods (.equals(), .startsWith(), etc.). I don't want to use RegEx's, the StringTokenizer class, Patterns, Matchers or anything other than String for that matter.

For example, this is how I want to call the method

String[] delimiters = {" ", "==", "=", "+", "+=", "++", "-", "-=", "--", "/", "/=", "*", "*=", "(", ")", ";", "/**", "*/", "\t", "\n"};
        String splitString[] = tokenizer(contents, delimiters);

And this is the code I ripped of the other question (I don't want to do this).

    private String[] tokenizer(String string, String[] delimiters) {
        // First, create a regular expression that matches the union of the
        // delimiters
        // Be aware that, in case of delimiters containing others (example &&
        // and &),
        // the longer may be before the shorter (&& should be before &) or the
        // regexpr
        // parser will recognize && as two &.
        Arrays.sort(delimiters, new Comparator<String>() {
            @Override
            public int compare(String o1, String o2) {
                return -o1.compareTo(o2);
            }
        });
        // Build a string that will contain the regular expression
        StringBuilder regexpr = new StringBuilder();
        regexpr.append('(');
        for (String delim : delimiters) { // For each delimiter
            if (regexpr.length() != 1)
                regexpr.append('|'); // Add union separator if needed
            for (int i = 0; i < delim.length(); i++) {
                // Add an escape character if the character is a regexp reserved
                // char
                regexpr.append('\\');
                regexpr.append(delim.charAt(i));
            }
        }
        regexpr.append(')'); // Close the union
        Pattern p = Pattern.compile(regexpr.toString());

        // Now, search for the tokens
        List<String> res = new ArrayList<String>();
        Matcher m = p.matcher(string);
        int pos = 0;
        while (m.find()) { // While there's a delimiter in the string
            if (pos != m.start()) {
                // If there's something between the current and the previous
                // delimiter
                // Add it to the tokens list
                res.add(string.substring(pos, m.start()));
            }
            res.add(m.group()); // add the delimiter
            pos = m.end(); // Remember end of delimiter
        }
        if (pos != string.length()) {
            // If it remains some characters in the string after last delimiter
            // Add this to the token list
            res.add(string.substring(pos));
        }
        // Return the result
        return res.toArray(new String[res.size()]);
    }
    public static String[] clean(final String[] v) {
        List<String> list = new ArrayList<String>(Arrays.asList(v));
        list.removeAll(Collections.singleton(" "));
        return list.toArray(new String[list.size()]);
    }

Edit: I ONLY want to use string methods charAt, equals, equalsIgnoreCase, indexOf, length, and substring

Community
  • 1
  • 1
Aditya Ramkumar
  • 377
  • 1
  • 13
  • Wow that's complicated. See my answer. https://en.wikipedia.org/wiki/KISS_principle – NickJ Oct 31 '15 at 17:36
  • @NickJ Haha, I wish I could make it easier. But this is for a project that I HAVE to do... – Aditya Ramkumar Oct 31 '15 at 21:26
  • Swatting flies with a sledgehammer – Jude Niroshan Nov 04 '15 at 04:50
  • You are using major regex – Mad Physicist Nov 06 '15 at 12:58
  • 2
    The specification is unclear, please provide a complete example with expected result. Delimiters like "=" and "==" or "-=" are ambiguous. Should t("a-==b", delimiters) return with [a,-,=,=,b] [a,-=,=,b] or [a,-,==,b] or whatever else? – tb- Nov 07 '15 at 00:32
  • From your example it looks like you're trying to do a lexical and syntax analysis of some language. If you want to do it properly - use proper tools. Look at something like that (you have other options): http://www.antlr.org. It will generate a proper parser from grammar description. – IceGlow Nov 10 '15 at 18:56

8 Answers8

9

EDIT: My original answer did not quite do the trick, it did not include the delimiters in the resultant array, and used the String.split() method, which was not allowed.

Here's my new solution, which is split into 2 methods:

/**
 * Splits the string at all specified literal delimiters, and includes the delimiters in the resulting array
 */
private static String[] tokenizer(String subject, String[] delimiters)  { 

    //Sort delimiters into length order, starting with longest
    Arrays.sort(delimiters, new Comparator<String>() {
        @Override
        public int compare(String s1, String s2) {
          return s2.length()-s1.length();
         }
      });

    //start with a list with only one string - the whole thing
    List<String> tokens = new ArrayList<String>();
    tokens.add(subject);

    //loop through the delimiters, splitting on each one
    for (int i=0; i<delimiters.length; i++) {
        tokens = splitStrings(tokens, delimiters, i);
    }

    return tokens.toArray(new String[] {});
}

/**
 * Splits each String in the subject at the delimiter
 */
private static List<String> splitStrings(List<String> subject, String[] delimiters, int delimiterIndex) {

    List<String> result = new ArrayList<String>();
    String delimiter = delimiters[delimiterIndex];

    //for each input string
    for (String part : subject) {

        int start = 0;

        //if this part equals one of the delimiters, don't split it up any more
        boolean alreadySplit = false;
        for (String testDelimiter : delimiters) {
            if (testDelimiter.equals(part)) {
                alreadySplit = true;
                break;
            }
        }

        if (!alreadySplit) {
            for (int index=0; index<part.length(); index++) {
                String subPart = part.substring(index);
                if (subPart.indexOf(delimiter)==0) {
                    result.add(part.substring(start, index));   // part before delimiter
                    result.add(delimiter);                      // delimiter
                    start = index+delimiter.length();           // next parts starts after delimiter
                }
            }
        }
        result.add(part.substring(start));                      // rest of string after last delimiter          
    }
    return result;
}

Original Answer

I notice you are using Pattern when you said you only wanted to use String methods.

The approach I would take would be to think of the simplest way possible. I think that is to first replace all the possible delimiters with just one delimiter, and then do the split.

Here's the code:

private String[] tokenizer(String string, String[] delimiters)  {       

    //replace all specified delimiters with one
    for (String delimiter : delimiters) {
        while (string.indexOf(delimiter)!=-1) {
            string = string.replace(delimiter, "{split}");
        }
    }

    //now split at the new delimiter
    return string.split("\\{split\\}");

}

I need to use String.replace() and not String.replaceAll() because replace() takes literal text and replaceAll() takes a regex argument, and the delimiters supplied are of literal text.

That's why I also need a while loop to replace all instances of each delimiter.

NickJ
  • 9,380
  • 9
  • 51
  • 74
  • Great! This is awesome. But how do I preserve the delimiters itself? I don't want to get rid of it. – Aditya Ramkumar Oct 31 '15 at 22:40
  • You still have the array of delimiters in your calling method – NickJ Nov 01 '15 at 16:40
  • No, I mean in the returned results. For example, If my delimiter was `{`, and a string was `ge{ab`, I would like an array with `ge`, `{` and `ab`. – Aditya Ramkumar Nov 03 '15 at 00:13
  • 1
    As @RealSkeptic says, using [split is using a regex](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-). – cheffe Nov 09 '15 at 07:02
  • I think my answer was given before the edit to the question listing acceptable methods, hence downvotes on that basis are a bit harsh. In the meantime, I am working on a new solution. It is an interesting challenge... – NickJ Nov 09 '15 at 09:15
  • @AdityaRamkumar I have edited my answer with your requirements. Seems to work, hopefully it's what you're after. – NickJ Nov 09 '15 at 17:14
3

Using only non-regex String methods... I used the startsWith(...) method, which wasn't in the exclusive list of methods that you listed because it does simply string comparison rather than a regex comparison.

The following impl:

public static void main(String ... params) {
    String haystack = "abcdefghijklmnopqrstuvwxyz";
    String [] needles = new String [] { "def", "tuv" };
    String [] tokens = splitIntoTokensUsingNeedlesFoundInHaystack(haystack, needles);
    for (String string : tokens) {
        System.out.println(string);
    }
}

private static String[] splitIntoTokensUsingNeedlesFoundInHaystack(String haystack, String[] needles) {
    List<String> list = new LinkedList<String>();
    StringBuilder builder = new StringBuilder();
    for(int haystackIndex = 0; haystackIndex < haystack.length(); haystackIndex++) {
        boolean foundAnyNeedle = false;
        String substring = haystack.substring(haystackIndex);
        for(int needleIndex = 0; (!foundAnyNeedle) && needleIndex < needles.length; needleIndex ++) {
            String needle = needles[needleIndex];
            if(substring.startsWith(needle)) {
                if(builder.length() > 0) {
                    list.add(builder.toString());
                    builder = new StringBuilder();
                }
                foundAnyNeedle = true;
                list.add(needle);
                haystackIndex += (needle.length() - 1);
            }
        }
        if( ! foundAnyNeedle) {
            builder.append(substring.charAt(0));
        }
    }
    if(builder.length() > 0) {
        list.add(builder.toString());
    }
    return list.toArray(new String[]{});
}

outputs

abc
def
ghijklmnopqrs
tuv
wxyz

Note... This code is demo-only. In the event that one of the delimiters is any empty String, it will behave poorly and eventually crash with OutOfMemoryError: Java heap space after consuming a lot of CPU.

Nathan
  • 1,576
  • 8
  • 18
1

As far as i understood your problem you can do something like this -

public Object[] tokenizer(String value, String[] delimeters){
    List<String> list= new ArrayList<String>();
    for(String s:delimeters){
        if(value.contains(s)){
            String[] strArr=value.split("\\"+s);
            for(String str:strArr){
                list.add(str);
                if(!list.contains(s)){
                    list.add(s);
                }
            }
        }
    }
    Object[] newValues=list.toArray();
    return newValues;
}

Now in the main method call this function -

String[] delimeters = {" ", "{", "==", "=", "+", "+=", "++", "-", "-=", "--", "/", "/=", "*", "*=", "(", ")", ";", "/**", "*/", "\t", "\n"};
    Object[] obj=st.tokenizer("ge{ab", delimeters); //st is the reference of the other class. Edit this of your own.
    for(Object o:obj){
        System.out.println(o.toString());
    }
Aritro Sen
  • 357
  • 7
  • 14
  • I thought you wanted to use String methods only. So split() and contains() both are String methods. (Here I have used contains() method of List and String.) – Aritro Sen Nov 05 '15 at 05:15
  • Your line `String[] strArr=value.split("\\"+s);` might not work - there's no guarantee that that `"\\"+s` will be a valid regex, it depends on s. It could easily fail. – NickJ Nov 05 '15 at 09:38
1

Suggestion:

  private static int INIT_INDEX_MAX_INT = Integer.MAX_VALUE;

  private static String[] tokenizer(final String string, final String[] delimiters) {
    final List<String> result = new ArrayList<>();

    int currentPosition = 0;
    while (currentPosition < string.length()) {
      // plan: search for the nearest delimiter and its position
      String nextDelimiter = "";
      int positionIndex = INIT_INDEX_MAX_INT;
      for (final String currentDelimiter : delimiters) {
        final int currentPositionIndex = string.indexOf(currentDelimiter, currentPosition);
        if (currentPositionIndex < 0) { // current delimiter not found, go to the next
          continue;
        }
        if (currentPositionIndex < positionIndex) { // we found a better one, update
          positionIndex = currentPositionIndex;
          nextDelimiter = currentDelimiter;
        }
      }
      if (positionIndex == INIT_INDEX_MAX_INT) { // we found nothing, finish up
        final String finalPart = string.substring(currentPosition, string.length());
        result.add(finalPart);
        break;
      }
      // we have one, add substring + delimiter to result and update current position
      // System.out.println(positionIndex + ":[" + nextDelimiter + "]"); // to follow the internals
      final String stringBeforeNextDelimiter = string.substring(currentPosition, positionIndex);
      result.add(stringBeforeNextDelimiter);
      result.add(nextDelimiter);
      currentPosition += stringBeforeNextDelimiter.length() + nextDelimiter.length();
    }

    return result.toArray(new String[] {});
  }

Notes:

  • I have added more comments than necessary. I guess it would help in this case.
  • The perfomance of this is quite bad (could be improved with tree structures and hashes). It was no part of the specification.
  • Operator precedence is not specified (see my comment to the question). It was no part of the specification.

I ONLY want to use string methods charAt, equals, equalsIgnoreCase, indexOf, length, and substring

Check. The function uses only indexOf(), length() and substring()

No, I mean in the returned results. For example, If my delimiter was {, and a string was ge{ab, I would like an array with ge, { and ab

Check:

  private static void test() {
    final String[] delimiters = { "{" };
    final String contents = "ge{ab";
    final String splitString[] = tokenizer(contents, delimiters);
    final String joined = String.join("", splitString);
    System.out.println(Arrays.toString(splitString));
    System.out.println(contents.equals(joined) ? "ok" : "wrong: [" + contents + "]#[" + joined + "]");
  }
  // [ge, {, ab]
  // ok

One final remark: I should advice to read about compiler construction, in particular the compiler front end, if one wants to have best practices for this kind of question.

tb-
  • 1,240
  • 7
  • 10
1

Maybe I haven't fully understood the question, but I have the impression that you want to rewrite the Java String method split(). I would advise you to have a look at this function, see how it's done and start from there.

Dominique
  • 16,450
  • 15
  • 56
  • 112
1

Honestly, you could use Apache Commons Lang. If you check the source code of library you will notice that it doesn't uses Regex. Only String and a lot of flags is used in method [StringUtils.split](http://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#split(java.lang.String, java.lang.String)).

Anyway, take a look in code using the Apache Commons Lang.

import org.apache.commons.lang.StringUtils;
import org.junit.Assert;
import org.junit.Test;

public class SimpleTest {

    @Test
    public void testSplitWithoutRegex() {
        String[] delimiters = {"==", "+=", "++", "-=", "--", "/=", "*=", "/**", "*/",
            " ", "=", "+", "-", "/", "*", "(", ")", ";", "\t", "\n"};

        String finalDelimiter = "#";

        //check if demiliter can be used
        boolean canBeUsed = true;

        for (String delimiter : delimiters) {
            if (finalDelimiter.equals(delimiter)) {
                canBeUsed = false;
                break;
            }
        }

        if (!canBeUsed) {
            Assert.fail("The selected delimiter can't be used.");
        }

        String s = "Assuming that we have /** or /* all these signals like == and; / or * will be replaced.";
        System.out.println(s);

        for (String delimiter : delimiters) {
            while (s.indexOf(delimiter) != -1) {
                s = s.replace(delimiter, finalDelimiter);
            }
        }

        String[] splitted = StringUtils.split(s, "#");

        for (String s1 : splitted) {
            System.out.println(s1);
        }

    }
}

I hope it helps.

josivan
  • 1,963
  • 1
  • 15
  • 26
1

As simple as I could get it...

public class StringTokenizer {
    public static String[] split(String s, String[] tokens) {
        Arrays.sort(tokens, new Comparator<String>() {
            @Override
            public int compare(String o1, String o2) {
                return o2.length()-o1.length();
            }
        });

        LinkedList<String> result = new LinkedList<>();

        int j=0;
        for (int i=0; i<s.length(); i++) {
            String ss = s.substring(i);

            for (String token : tokens) {
                if (ss.startsWith(token)) {
                    if (i>j) {
                        result.add(s.substring(j, i));
                    }

                    result.add(token);

                    j = i+token.length();
                    i = j-1;

                    break;
                }
            }
        }

        result.add(s.substring(j));

        return result.toArray(new String[result.size()]);
    }
}

It does a lot of new objects creation - and could be optimized by writing custom startsWith() implementation that would compare char by char of the string.

@Test
public void test() {
    String[] split = StringTokenizer.split("this==is the most>complext<=string<<ever", new String[] {"=", "<", ">", "==", ">=", "<="});

    assertArrayEquals(new String[] {"this", "==", "is the most", ">", "complext", "<=", "string", "<", "<", "ever"}, split);
}

passes fine :)

Grogi
  • 2,099
  • 17
  • 13
1

You can use recursion (a hallmark of functional programming) to make it less verbose.

public static String[] tokenizer(String text, String[] delims) {
    for(String delim : delims) {
        int i = text.indexOf(delim);

        if(i >= 0) {

            // recursive call
            String[] tail = tokenizer(text.substring(i + delim.length()), delims);

            // return [ head, middle, tail.. ]
            String[] list = new String[tail.length + 2];
            list[0] = text.substring(0,i);
            list[1] = delim;
            System.arraycopy(tail, 0, list, 2, tail.length);
            return list;
        }
    }
    return new String[] { text };
}

Tested it using the same unit-test from the other answer

public static void main(String ... params) {
    String haystack = "abcdefghijklmnopqrstuvwxyz";
    String [] needles = new String [] { "def", "tuv" };
    String [] tokens = tokenizer(haystack, needles);
    for (String string : tokens) {
        System.out.println(string);
    }
}

Output

abc
def
ghijklmnopqrs
tuv
wxyz

It would be a little more elegant if Java had better native array support.

Alex R
  • 11,364
  • 15
  • 100
  • 180