2

I have a very simple tokenizer using StreamTokenizer, which will convert mathematical expressions into their individual components (below). The problem that I am having, is if there is a variable in the expression called T_1, it will split into [T,_,1], which I would like to return as [T_1].

I have attempted to use a variable to check if the last character was an underscore, and if so, append the underscore onto the list.Size-1, but it seems like a very clunky and inefficient solution. Is there a way to do this? Thanks!

        StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
        tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
        tokenizer.ordinaryChar('/'); // Don't parse slash as part of numbers.
        List<String> tokBuf = new ArrayList<String>();
        while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) //While not the end of file 
        {
            switch (tokenizer.ttype) //Switch based on the type of token
            {
            case StreamTokenizer.TT_NUMBER: //Number
                tokBuf.add(String.valueOf(tokenizer.nval));
                break;
            case StreamTokenizer.TT_WORD: //Word
                tokBuf.add(tokenizer.sval);
                break;
            case '_':
                tokBuf.add(tokBuf.size()-1, tokenizer.sval);
                break;
            default: //Operator
                tokBuf.add(String.valueOf((char) tokenizer.ttype));
            }
        }

        return tokBuf;
Suvasis
  • 1,451
  • 4
  • 24
  • 42
Archetype90
  • 179
  • 1
  • 4
  • 19
  • I'm not seeing what you're seeing. If I pass in `T_1`, I get this as output: `[null, T, 1.0]` – Daniel Kaplan Sep 26 '14 at 18:26
  • I feel like `wordChars` is somehow related to the answer, but I can't figure out how to *add* word chars. Seems like you can only set a range. Surprisingly poor documentation and API for a Java class, IMO. Is there a legitimate reason you're using a `StreamTokenizer` over a `StringTokenizer`? – Daniel Kaplan Sep 26 '14 at 18:31
  • I am really sorry, I provided code that I had not completely fixed. The code above should not include the case for '_'. That was a relic of my attempts to add it on to the last element in the list. And no, there is no legitimate reason that I am using StreamTokenizer. Do you feel that StringTokenizer is superior? – Archetype90 Sep 26 '14 at 18:33
  • Not necessarily. It's about using the right tool for the job. See how it works, it may be a better fit: http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html – Daniel Kaplan Sep 26 '14 at 18:41
  • Well said. That may be a better option if I cannot figure out how to not delimit after an underscore with Streamtokenizer, but also may require a large set of delimiters because of the number of operators. – Archetype90 Sep 26 '14 at 18:45

2 Answers2

4

This is what you want.

tokenizer.wordChars('_', '_');

This makes the _ recognizable as part of a word.

Addenda:

This builds and runs:

public static void main(String args[]) throws Exception {
    String s = "abc_xyz abc 123 1 + 1";
    StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
    tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
    tokenizer.ordinaryChar('/'); // Don't parse slash as part of numbers.
    tokenizer.wordChars('_', '_'); // Don't parse slash as part of numbers.


    List<String> tokBuf = new ArrayList<String>();
    while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) //While not the end of file 
    {
        switch (tokenizer.ttype) //Switch based on the type of token
        {
        case StreamTokenizer.TT_NUMBER: //Number
            tokBuf.add(String.valueOf(tokenizer.nval));
            break;
        case StreamTokenizer.TT_WORD: //Word
            tokBuf.add(tokenizer.sval);
            break;
        default: //Operator
            tokBuf.add(String.valueOf((char) tokenizer.ttype));
        }
    }
    System.out.println(tokBuf);
}

run:
[abc_xyz, abc, 123.0, 1.0, +, 1.0]
Will Hartung
  • 115,893
  • 19
  • 128
  • 203
  • Excellent! This actually did it for me. Just throwing that line right under the tokenizer.ordinaryCharacter. – Archetype90 Sep 26 '14 at 19:12
  • For others, I tweaked this to show that you can call `wordChars` multiple times and it considers each call, not just the most recent. It is kind of unusual how this is not documented in the javadoc. – Daniel Kaplan Sep 26 '14 at 19:13
0

A StringTokenizer may be a better fit. If so, here's how you use it:

import java.util.ArrayList; import java.util.List; import java.util.StringTokenizer;

public class Solution {

    public static void main(String args[]) throws Exception {
        StringTokenizer tokenizer = new StringTokenizer("T_1 1 * bar");
        List<String> tokBuf = new ArrayList<String>();
        while (tokenizer.hasMoreTokens()) //While not the end of file
        {
            tokBuf.add(tokenizer.nextToken());
        }

        System.out.println(tokBuf);
    }
}

This printed out:

[T_1, 1, *, bar]
Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356