Scanning a number and returning the lexeme in the input stream- Java?

Question

I am trying to write a method that will scan the input and return a String representing the lexeme found in the input string.

This is what I have so far but I don't know if I'm going in the right direction-- all help would be appreciated :)

private String scanNumbers(char input)
{
   String result= "";
   int value = in.read()
   if(value != -1)
   {
      If(isDigit(input))
       {
         result = Integer.toString(value);
        }
   }
 return result;
}

public static boolean isDigit(char input)
{
    return (input >= '0' && input <= '9');
}

Thank you I am new to parsing/lexemes/compilers.

I wouldn't guess that a lexeme consists solely of decimal digits. One thing to be careful of is that `System.in.read()` returns the next byte in the stream, not necessarily the next character. Characters can consist of multiple bytes. — markspace, Aug 30 '14 at 01:21
Use an [InputStreamReader.](http://docs.oracle.com/javase/tutorial/i18n/text/stream.html) — markspace, Aug 30 '14 at 02:09
"in" in "in.read()" is an input stream reader-- so this should work? — Surz, Aug 30 '14 at 04:21

Brian Tompsett - 汤莱恩 · Answer 1 · 2020-10-08T18:30:09.210

Introduction

Questions that appear to be related to a homework exercise are often slow to be answered on SO. We often wait until the deadline has well passed!

You mention you are new to the topics of parsing/lexemes/compilers, and want some help in writing a Java method to scan the input and return a string representing the lexeme found in the input string. Later you clarify, indicating that you want a method that skips characters until it finds digits.

There is quite a bit of confusion in your question which produces conflicts in what you want to achieve.

It is not clear if you are wanting to learn about performing lexical analysis in Java as part of a larger compiler project, whether you only want to do it with numbers, whether you are looking for existing tools or methods that do this or are trying to learn how to program such methods yourself. If you are programming, whether you only need to know about reading a number, or if this is just an example of the kind of things you want to do.

Lexical Analysis

Lexical analysis, which is also known as scanning, is the process of reading a corpus of text which is composed of characters. This can be done for several purposes, such as data input, linguistic analysis of written material (such as word frequency counting) or part of language compilation or interpretation. When done as part of compilation it is one (and usually the first) of a sequence of phases that include parsing, semantic analysis, code generation, optimisation and such. In the writing of compilers code generator tools are usually used, so if it was desired to write a compiler in Java, then a Java lexical generator and a Java parser generator would often be used to create the Java code for those compiler components. Sometimes that lexer and parser are hand written, but it is not a recommended task for a novice. It would require a compiler writing specialist to build a compiler by hand better than a tool-set. Sometimes, as a class exercise, students are asked to write code to perform a piece lexical analysis to help them understand the process, but this is often for a few lexemes, like your digit exercise.

The term lexeme is used to describe a sequence of characters that compose an individual entity recognised by a lexical analyser. Once recognised it is usually represented by a token. The lexeme is therefore replaced by a token as part of the lexical analysis process. A lexical analyser will sometime record the lexeme in a symbol table for later use before replacing it by the token. This is how identifiers in programs are often recorded in a compiler.

There are several tools for building lexers in Java. Two of the most common are Jlex and JFlex. To illustrate how they work, to recognise an integer whilst skipping whitespace, we would use the following rules:

%%
WHITE_SPACE_CHAR=[\n\ \t\b\012]
DIGIT=[0-9]
%%
{WHITE_SPACE_CHAR}+  { }
{DIGIT}+   { return(new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length())); }
%%

which would be processed by the tool to produce Java methods to achieve that task.

The notations used to describe the lexemes are usually written as regular expressions. Computer Science theory can help us with the programming of a lexical analyser. Regular expressions can be represented by a form of finite state automata. There is a particular style of coding that can be used to match lexemes that experienced programers would recognise and use in this situation, which involves a switch inside a loop:

while ( ! eof ) {
  switch ( next_symbol() ) {

  case symbol:
      ...
  break;
  default:
        error(diagnostic); break;
  }
 }

It is often these concepts that a simple lexical programming exercise is intended to introduce to students.

Tokenizing in Java

With all those preliminary explanations out of the way, lets get down to your piece of Java code. As mentioned in the comments there is a difference in Java between reading bytes from an input stream and reading characters, as characters are in unicode, which is represented by two bytes. You have used a byte read within a character processing method.

The recognising simple tokens in an input stream, particularly for data entry, is such a common activity that Java has a specific built-in class for that called the StreamTokenizer.

We could implement your task in the following way, for example:

    // create a new tokenizer
     Reader r = new BufferedReader(new InputStreamReader( System.in ));
     StreamTokenizer st = new StreamTokenizer(r);

     // print the stream tokens
     boolean eof = false;
     do {

        int token = st.nextToken();
        switch (token) {
           case StreamTokenizer.TT_EOF:
              System.out.println("End of File encountered.");
              eof = true;
              break;
           case StreamTokenizer.TT_EOL:
              System.out.println("End of Line encountered.");
              break;
           case StreamTokenizer.TT_NUMBER:
              System.out.println("Number: " + st.nval);
              break;
           default:
              System.out.println((char) token + " encountered.");
              if (token == '!') {
                 eof = true;
              }
        }
     } while (!eof);

However, this does not return the string of the lexeme for a number, only matches the number and gets the value.

I see you have noticed the Java class java.util.scanner because your question had that as a tag. This is another class that can perform similar operations. We could get an integer lexeme from the input like this:

Scanner s = new Scanner(System.in);
System.out.println(s.nextInt());

Solution

Finally, lets re-write your original code to find the lexeme for an integer skipping over an unwanted characters, in which I use java regular expression matching.

import java.io.IOException;    import java.io.InputStreamReader;
import java.util.regex.Pattern;
public class ReadNumbers {
    static InputStreamReader in = null;            // Have input source as a global
    static int value = -1;                         // and the current input value       
    public static void main ( String [] args ) {
        try {
            in = new InputStreamReader(System.in); // Set up the input
            value = in.read();                     // pre-fill the input state              
            System.out.println(scanNumbers()) ;               
        }
        catch (Exception e) {
           e.printStackTrace();            // print error
        } 
    }
    private static String scanNumbers() {
        String SkipCharacters = "\\s" ;           // Characters that can be skipped
        String result= "";                        // empty string to store lexeme
        int charcount=0;
        try {
            while ( (value != -1) && Pattern.matches(SkipCharacters,"" + (char)value) ) 
                // Now skip optional characters before the number
                value = in.read() ;               // pre-load the next character
            while ( (value != -1) && isDigit((char)value)) { 
               // Now find the number digits
               result = result + (char)value;    // append digit character to result
               value = in.read() ;               // pre-load the next character
            }
        } finally {
           return result;
        }
    }
    public static boolean isDigit(char input) {
        return (input >= '0' && input <= '9');
    }
}

Afterword

The comment from @markspace is interesting and useful, as it points out not all numbers are soley decimal digits. Consider numbers in other bases, like hexdecimal. Java allows integer constants to be specified in those number bases which do not just use the digits 0..9.

Scanning a number and returning the lexeme in the input stream- Java?

1 Answers1