Java Tokenization: Treat Anything Separated by an Underscore as One Word

Question

I have a very simple tokenizer using StreamTokenizer, which will convert mathematical expressions into their individual components (below). The problem that I am having, is if there is a variable in the expression called T_1, it will split into [T,_,1], which I would like to return as [T_1].

I have attempted to use a variable to check if the last character was an underscore, and if so, append the underscore onto the list.Size-1, but it seems like a very clunky and inefficient solution. Is there a way to do this? Thanks!

        StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
        tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
        tokenizer.ordinaryChar('/'); // Don't parse slash as part of numbers.
        List<String> tokBuf = new ArrayList<String>();
        while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) //While not the end of file 
        {
            switch (tokenizer.ttype) //Switch based on the type of token
            {
            case StreamTokenizer.TT_NUMBER: //Number
                tokBuf.add(String.valueOf(tokenizer.nval));
                break;
            case StreamTokenizer.TT_WORD: //Word
                tokBuf.add(tokenizer.sval);
                break;
            case '_':
                tokBuf.add(tokBuf.size()-1, tokenizer.sval);
                break;
            default: //Operator
                tokBuf.add(String.valueOf((char) tokenizer.ttype));
            }
        }

        return tokBuf;

I'm not seeing what you're seeing. If I pass in `T_1`, I get this as output: `[null, T, 1.0]` — Daniel Kaplan, Sep 26 '14 at 18:26
I feel like `wordChars` is somehow related to the answer, but I can't figure out how to *add* word chars. Seems like you can only set a range. Surprisingly poor documentation and API for a Java class, IMO. Is there a legitimate reason you're using a `StreamTokenizer` over a `StringTokenizer`? — Daniel Kaplan, Sep 26 '14 at 18:31
I am really sorry, I provided code that I had not completely fixed. The code above should not include the case for '_'. That was a relic of my attempts to add it on to the last element in the list. And no, there is no legitimate reason that I am using StreamTokenizer. Do you feel that StringTokenizer is superior? — Archetype90, Sep 26 '14 at 18:33
Not necessarily. It's about using the right tool for the job. See how it works, it may be a better fit: http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html — Daniel Kaplan, Sep 26 '14 at 18:41
Well said. That may be a better option if I cannot figure out how to not delimit after an underscore with Streamtokenizer, but also may require a large set of delimiters because of the number of operators. — Archetype90, Sep 26 '14 at 18:45

Will Hartung · Accepted Answer · 2014-09-26T19:09:09.597

This is what you want.

tokenizer.wordChars('_', '_');

This makes the _ recognizable as part of a word.

Addenda:

This builds and runs:

public static void main(String args[]) throws Exception {
    String s = "abc_xyz abc 123 1 + 1";
    StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
    tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
    tokenizer.ordinaryChar('/'); // Don't parse slash as part of numbers.
    tokenizer.wordChars('_', '_'); // Don't parse slash as part of numbers.


    List<String> tokBuf = new ArrayList<String>();
    while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) //While not the end of file 
    {
        switch (tokenizer.ttype) //Switch based on the type of token
        {
        case StreamTokenizer.TT_NUMBER: //Number
            tokBuf.add(String.valueOf(tokenizer.nval));
            break;
        case StreamTokenizer.TT_WORD: //Word
            tokBuf.add(tokenizer.sval);
            break;
        default: //Operator
            tokBuf.add(String.valueOf((char) tokenizer.ttype));
        }
    }
    System.out.println(tokBuf);
}

run:
[abc_xyz, abc, 123.0, 1.0, +, 1.0]

Excellent! This actually did it for me. Just throwing that line right under the tokenizer.ordinaryCharacter. — Archetype90, Sep 26 '14 at 19:12
For others, I tweaked this to show that you can call `wordChars` multiple times and it considers each call, not just the most recent. It is kind of unusual how this is not documented in the javadoc. — Daniel Kaplan, Sep 26 '14 at 19:13

Daniel Kaplan · Answer 2 · 2014-09-26T19:13:45.320

A StringTokenizer may be a better fit. If so, here's how you use it:

import java.util.ArrayList; import java.util.List; import java.util.StringTokenizer;

public class Solution {

    public static void main(String args[]) throws Exception {
        StringTokenizer tokenizer = new StringTokenizer("T_1 1 * bar");
        List<String> tokBuf = new ArrayList<String>();
        while (tokenizer.hasMoreTokens()) //While not the end of file
        {
            tokBuf.add(tokenizer.nextToken());
        }

        System.out.println(tokBuf);
    }
}

This printed out:

[T_1, 1, *, bar]

Java Tokenization: Treat Anything Separated by an Underscore as One Word

2 Answers2