I am writing a simple shell in OCaml. My grammar looks a little like this at the moment
/*
program: command*
command: WORD WORD* redirection*
redirection: NUMBER? > FILENAME | < FILENAME
*/
Simple enough. I have a lexer file that looks like this:
let whitespace = [' ' '\t']+
let word = [^ '<' '>']+
let filename = [^ '\x00']+
let number = ['0' - '9']+
let newline = '\n' | "\r\n"
rule token = parse
| whitespace | newline { token lexbuf }
| number as lxm { NUMBER(int_of_string lxm) }
| word as lxm { WORD lxm }
| filename as lxm { FILENAME lxm }
| eof { EOF }
| _ as lxm { raise @@ SyntaxError("Unexpected char" ^ (String.make 1 lxm)) }
And a Menhir parser (likely not the problem but for completeness)
%{
open Ast
%}
%token <int> NUMBER
%token <string> WORD
%token <string> FILENAME
%token LEFTARROW
%token RIGHTARROW
%token EOF
%start program
%type <Ast.redirection> redirection
%type <Ast.command> command
%type <Ast.program> program
%%
redirection:
| LEFTARROW f = FILENAME { 0, f }
| opt = option(NUMBER) RIGHTARROW f = FILENAME {
match opt with
| None -> 1, f
| Some n -> n, f
}
command:
| executable = WORD args = list(WORD) redirections = list(redirection) { {executable; args; redirections} }
program:
| commands = list(command) EOF { commands }
I have defined a utility function parse_string
that looks like this
let parse_string program_string =
let lexbuf = Lexing.from_string program_string in
Parser.program Lexer.token lexbuf
For completeness, here is my AST
type redirection = int * string
type command = {
executable : string;
args : string list;
redirections : redirection list;
}
type program = command list
Now, given the input string "ls -l" I would expect the result to be {executable = "ls"; args = ["-l"]; redirections = []}
However I get a single executable string, "ls -l"
My lexer is not correctly tokenizing my input into separate strings. It is clumping my input into a single WORD definition.
So my question is, using ocamllex (with Menhir as a parser) how can I enforce strings to tokenize on whitespace?
EDIT:
Since I wanted to get something working quickly I followed the advice below and changed my word rule to [^ '<' '>' ' ' '\t']
to avoid whitespace. I also ran into a problem where a string "foo bar" was not being tokenized into 2 words. The problem was my filename rule; it turns out that ocamllex does not apply the rules in the order that the match specifies but always uses the longest matching rule. So "foo bar" was parsed into a single filename. I've removed the filname rule completely for now, the right thing to do is probably to only allow spaces in file names if they are quoted strings.