I am using a Lexical analyzer to tokenize some operators, conditions, and syntaxes. My approach is checking each and every character and when it finds a space between characters, it tokenizes the combined characters. eg. when it finds 'String' it tokenizes it as STR or ';' and tokenizes it as SEMI. so whenever it finds an operator like '++' it just tokenizes it as ADD_OP ADD_OP but I want it to tokenize it as one token i.e '++' should be INC or '>=' it prints out GT ASSIGN rather than GE. here is my code:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class Lexer {
public static void Tokenize(String fileName) {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(fileName));
int r;
String token = "";
while ((r = reader.read()) != -1) {
char ch = (char) r;
if (Character.isWhitespace(ch)) {
if (!token.isBlank()) {
String[] tokens = tokenizeToken(token);
for (String t : tokens) {
if (!t.isBlank()) {
System.out.println(t);
}
}
}
token = "";
} else {
token += ch;
}
}
if (!token.isBlank()) {
String[] tokens = tokenizeToken(token);
for (String t : tokens) {
if (!t.isBlank()) {
System.out.println(t);
}
}
}
} catch (IOException e) {
System.err.println("Error reading file: " + e.getMessage());
} finally {
try {
if (reader != null) {
reader.close();
}
} catch (IOException e) {
System.err.println("Error closing file: " + e.getMessage());
}
}
}
private static String[] tokenizeToken(String token) {
String[] tokens = token.split("(?=[\\[\\](){}<>=,;+-/*%|&!])|(?<=[\\[\\](){}<>=,;+-/*%|&!])");
for (int i = 0; i < tokens.length; i++) {
String t = tokens[i].trim();
if (t.matches("procedure")) {
tokens[i] = "PROC";
} else if (t.matches("int")) {
tokens[i] = "INT";
} else if (t.matches("[0-9]+")) {
tokens[i] = "INT_CONST";
} else if (t.matches("end")) {
tokens[i] = "END";
} else if (t.matches("String") || t.matches("string")) {
tokens[i] = "STR";
} else if (t.matches("[(]")) {
tokens[i] = "LP";
} else if (t.matches("[)]")) {
tokens[i] = "RP";
} else if (t.matches("\".*\"")) {
tokens[i] = "STR_CONST";
} else if (t.matches("if")) {
tokens[i] = "IF";
} else if (t.matches("for")) {
tokens[i] = "FOR";
} else if (t.matches("while")) {
tokens[i] = "WHILE";
} else if (t.matches("return")) {
tokens[i] = "RETURN";
} else if (t.matches("[;]")) {
tokens[i] = "SEMI";
} else if (t.matches("do")) {
tokens[i] = "DO";
} else if (t.matches("break")) {
tokens[i] = "BREAK";
} else if (t.matches("[a-zA-Z][a-zA-Z0-9]*")) {
tokens[i] = "IDENT";
} else if (t.matches("[=]")) {
tokens[i] = "ASSIGN";
} else if (t.matches("[<]")) {
tokens[i] = "LT";
} else if (t.matches("[>]")) {
tokens[i] = "RT";
} else if (token.matches("[++]")) {
tokens[i] = "INC";
} else if (t.matches("[+]")) {
tokens[i] = "ADD_OP";
} else if (token.matches("[{]")) {
tokens[i] = "RB";
} else if (token.matches("[}]")) {
tokens[i] = "LB";
} else if (token.matches("[*]")) {
tokens[i] = "MUL_OP";
} else if (token.matches("[/]")) {
tokens[i] = "DIV_OP";
} else if (token.matches("[>=]")) {
tokens[i] = "GE";
} else {
System.out.println("SYSTEM ERROR: INVALID IDENTIFIER NAME");
}
}
return tokens;
}
}
and here is the output:
PROC
IDENT
LP
INT
IDENT
RP
FOR
LP
INT
IDENT
ASSIGN
INT_CONST
SEMI
IDENT
LT
IDENT
SEMI
IDENT
ASSIGN
IDENT
ADD_OP
ADD_OP
RP
RB
IDENT
ASSIGN
IDENT
MUL_OP
LP
IDENT
DIV_OP
INT_CONST
RP
SEMI
IF
LP
IDENT
RT
ASSIGN
INT_CONST
RP
BREAK
SEMI
LB
RETURN
IDENT
SEMI
END
IDENT
INT
SYSTEM ERROR: INVALID IDENTIFIER NAME
9user
ASSIGN
INT_CONST
SEMI
Also if any syntax or operator isn't listed should not print out in the output. e.g. the 9user shouldn't be printed out.