Need a simple Bison grammar for HTML

Question

I've looked at the Bison help and have written this, but I'm not sure it's completely correct. Also i need an yylex() that handle Lexical Analyzer (It should be Flex tool). I know some basic things about context-free grammars. But i don't know how to implement them correctly! :(

I want a simple Bison grammar for HTML. The question is: What should change in following grammar?

%{
    #include <stdio.h>
    int yylex(void);
    int yyerror(char const *);
%}

%token NUM_TOKEN FILENAME_TOKEN COLOR_TOKEN NAME_TOKEN

/* Html Grammer follows... */
%%


/* Any html tag follow this pattern: */
EXPRESSION: 
            '<' TAG CLUSER '>' INNER_EXPRESSION "</" TAG '>' ;

/* Some html tags: */
TAG: 
     "a"    |
     "html" |
     "head" |
     "link" |
     "div"  |
     "input"|
     "from" |
     "title"|
     "img"  |
     "table"|
     "td"   |
     "tr"   ;


CLUSER:
       ALIGN|
       CLASS|
       ID|
       SRC|
       TEPY|
       ACTION|
       HREF|
       REL|
       /* € (Eplsilone) */
       ;


ALIGN:
      "align" '=' "left"|
      "align" '=' "right"|
      "align" '=' "center"
      ;

CLASS:
      "class" '=' NAME_TOKEN
      ;

ID:
      "id" '=' NAME_TOKEN
      ;

SRC:
      "src" '=' FILENAME_TOKEN
      ;

TEPY:
      "type" '=' CONT
      ;

ACTION:
      "action" '=' FILENAME_TOKEN
      ;

HREF:
      "href" '=' '\"#\"'|
      "href" '=' FILENAME_TOKEN
      ;

REL:
      "rel" '=' "stylesheet"|
      "rel" '=' "\"stylesheet\""
      ;


DOMIN:
      "px"|
      "mm"|
      "cm"|
      "inch"
      ;

PAS:
     "php"|
     "asp"|
     "aspx"|
     "css"
     ;

CONT:
     "button"|
     "checkbox"|
     "text"|
     "password"|
     "file"|
     "submit"
     ;

INNER_EXPRESSION:
     EXPRESSION|
     /* € (Eplsilone) */
     ;


/* Html grammer ends. */
%%

This is Bison's output:

E:\Program Files\GnuWin32\bin>bison "E:\Dev-Cpp\HtmlBison\html.y" -o "E:\html.c"

E:\Dev-Cpp\HtmlBison\html.y: warning: 2 nonterminals useless in grammar
E:\Dev-Cpp\HtmlBison\html.y: warning: 8 rules useless in grammar
E:\\Dev-Cpp\\HtmlBison\\html.y:83.1-5: warning: nonterminal useless in grammar:
DOMIN
E:\\Dev-Cpp\\HtmlBison\\html.y:90.1-3: warning: nonterminal useless in grammar:
PAS
E:\\Dev-Cpp\\HtmlBison\\html.y:84.7-10: warning: rule useless in grammar: DOMIN:
 "px"
E:\\Dev-Cpp\\HtmlBison\\html.y:85.7-10: warning: rule useless in grammar: DOMIN:
 "mm"
E:\\Dev-Cpp\\HtmlBison\\html.y:86.7-10: warning: rule useless in grammar: DOMIN:
 "cm"
E:\\Dev-Cpp\\HtmlBison\\html.y:87.7-12: warning: rule useless in grammar: DOMIN:
 "inch"
E:\\Dev-Cpp\\HtmlBison\\html.y:91.6-10: warning: rule useless in grammar: PAS: "
php"
E:\\Dev-Cpp\\HtmlBison\\html.y:92.6-10: warning: rule useless in grammar: PAS: "
asp"
E:\\Dev-Cpp\\HtmlBison\\html.y:93.6-11: warning: rule useless in grammar: PAS: "
aspx"
E:\\Dev-Cpp\\HtmlBison\\html.y:94.6-10: warning: rule useless in grammar: PAS: "
css"
m4: cannot open `Files\GnuWin32/share/bison': No such file or directory
m4: cannot open `E:\Program': No such file or directory
m4: cannot open `Files\GnuWin32/share/bison/m4sugar/m4sugar.m4': No such file or
 directory

It's not going to be a complete HTML parser. I just want to validate very simple HTML documents without any CSS styles or JavaScripts or ... I also saw this. NOTE: The Solution must be a Bison grammar!

Oh, and get rid of dev-cpp. See http://www.jasonbadams.net/20081218/why-you-shouldnt-use-dev-c/ — ThiefMaster, Jan 22 '11 at 20:50

score 4 · Accepted Answer · answered Jan 22 '11 at 20:58

4

TAG should be a token that is returned from a lexer, else you will be writing cases till the cows come home.

Same goes for attributes, etc.

answered Jan 22 '11 at 20:58

leppie

115,091
17
196
297

1

I don't think "TAG" should be a token; it's a parser-level construct in my opinion. I guess it depends on what you mean by "TAG"; is it just the tag name? In that case, yes the lexer should just worry about the tag identifier as a token, and the parser should be worrying about the collection of tags it's willing to recognize. – Pointy Jan 22 '11 at 21:04
@Pointy: That's what I mean. Would be better to just call it `IDENTIFIER` or such. – leppie Jan 22 '11 at 21:06
although now that I think of it, a limited XML parser generator I worked on once would pre-load the table of identifiers into a hash. The lexer would recognize "identifier", but would then perform the hash lookup as a convenience. It could then give the parser an integer code for the tag name (or something like -1 for an unknown name), which made the parser a lot faster. Of course, the parser could perform that lookup too I guess. – Pointy Jan 22 '11 at 21:09
@Pointy: I agree. Question remains though whether bison actually supports 'string literals'. I know many integrated lexer/parser gens like ANTLR does. – leppie Jan 22 '11 at 21:12
Thank you leppie & Pointy for reply, Do you have any idea about Bison's output??? – Jalal Jan 22 '11 at 22:10

Need a simple Bison grammar for HTML

1 Answers1