6

I'm looking for a bibtex grammar in ANTLR to use in a hobby project. I don't want to spend my time for writing ANTLR grammar (this may take some time for me because it will involve a learning curve). So I'd appreciate for any pointers.

Note: I've found bibtex grammars for bison and yacc but couldn't find any for antlr.

Edit: As Bart pointed the I don't need to parse the preambles and tex in the quoted strings.

systemsfault
  • 15,207
  • 12
  • 59
  • 66
  • 1
    How accurate/precise do you need it to be? Preambles can be a mess to parse and inside quoted- or braced content, you can officially embed "math"-like code like this: `"text $2 \times \pi$ text"`, AFAIK. Do you want to parse all that as well, or can preambles and quoted (or braced) things be tokenized as a single token? – Bart Kiers Sep 28 '11 at 22:00
  • Hi Bart, thanks for the pointer. I don't need to parse the tex in bibtex and the quoted (or quoted or preambled) texts can be tokenized as a single token as well. – systemsfault Sep 29 '11 at 06:40
  • then I might have just what you're after (a not-so-precise BibTex grammar, that is). Let me dust it off for you and write a small test class. – Bart Kiers Sep 29 '11 at 12:57
  • [jBibTeX](https://github.com/jbibtex/jbibtex) is a complete library for handling BibTeX files. There should be no need for a separate grammar, is there any? – koppor Aug 24 '15 at 19:41

1 Answers1

9

Here's a (very) rudimentary BibTex grammar that emits an AST (contrary to a simple parse tree):

grammar BibTex;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  BIBTEXFILE;
  TYPE;
  STRING;
  PREAMBLE;
  COMMENT;
  TAG;
  CONCAT;
}

//////////////////////////////// Parser rules ////////////////////////////////
parse
  :  (entry (Comma? entry)* Comma?)? EOF             -> ^(BIBTEXFILE entry*)
  ;

entry
  :  Type Name Comma tags CloseBrace                 -> ^(TYPE Name tags)
  |  StringType Name Assign QuotedContent CloseBrace -> ^(STRING Name QuotedContent)
  |  PreambleType content CloseBrace                 -> ^(PREAMBLE content)
  |  CommentType                                     -> ^(COMMENT CommentType)
  ;

tags
  :  (tag (Comma tag)* Comma?)?                      -> tag*
  ;

tag
  :  Name Assign content                             -> ^(TAG Name content)
  ;

content
  :  concatable (Concat concatable)*                 -> ^(CONCAT concatable+)
  |  Number
  |  BracedContent
  ;

concatable
  :  QuotedContent
  |  Name
  ;

//////////////////////////////// Lexer rules ////////////////////////////////
Assign
  :  '='
  ;

Concat
  :  '#'
  ;

Comma
  :  ','
  ;

CloseBrace
  :  '}'
  ;

QuotedContent
  :  '"' (~('\\' | '{' | '}' | '"') | '\\' . | BracedContent)* '"'
  ;

BracedContent
  :  '{' (~('\\' | '{' | '}') | '\\' . | BracedContent)* '}'
  ;

StringType
  :  '@' ('s'|'S') ('t'|'T') ('r'|'R') ('i'|'I') ('n'|'N') ('g'|'G') SP? '{'
  ;

PreambleType
  :  '@' ('p'|'P') ('r'|'R') ('e'|'E') ('a'|'A') ('m'|'M') ('b'|'B') ('l'|'L') ('e'|'E') SP? '{'
  ;

CommentType
  :  '@' ('c'|'C') ('o'|'O') ('m'|'M') ('m'|'M') ('e'|'E') ('n'|'N') ('t'|'T') SP? BracedContent
  |  '%' ~('\r' | '\n')*
  ;

Type
  :  '@' Letter+ SP? '{'
  ;

Number
  :  Digit+
  ;

Name
  :  Letter (Letter | Digit | ':' | '-')*
  ;

Spaces
  :  SP {skip();}
  ;

//////////////////////////////// Lexer fragments ////////////////////////////////
fragment Letter
  :  'a'..'z'
  |  'A'..'Z'
  ;

fragment Digit
  :  '0'..'9'
  ;

fragment SP
  :  (' ' | '\t' | '\r' | '\n' | '\f')+
  ;  

(if you don't want the AST, remove all -> and everything to the right of it and remove both the options{...} and tokens{...} blocks)

which can be tested with the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {

    // parse the file 'test.bib'
    BibTexLexer lexer = new BibTexLexer(new ANTLRFileStream("test.bib"));
    BibTexParser parser = new BibTexParser(new CommonTokenStream(lexer));

    // you can use the following tree in your code
    // see: http://www.antlr.org/api/Java/classorg_1_1antlr_1_1runtime_1_1tree_1_1_common_tree.html
    CommonTree tree = (CommonTree)parser.parse().getTree();

    // print a DOT tree of our AST
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

and the following example Bib-input (file: test.bib):

@PREAMBLE{
  "\newcommand{\noopsort}[1]{} "
  # "\newcommand{\singleletter}[1]{#1} " 
}

@string { 
  me = "Bart Kiers" 
}

@ComMENt{some comments here}

% or some comments here

@article{mrx05,
  auTHor = me # "Mr. X",
  Title = {Something Great}, 
  publisher = "nob" # "ody",
  YEAR = 2005,
  x = {{Bib}\TeX},
  y = "{Bib}\TeX",
  z = "{Bib}" # "\TeX",
},

@misc{ patashnik-bibtexing,
       author = "Oren Patashnik",
       title = "BIBTEXing",
       year = "1988"
} % no comma here

@techreport{presstudy2002,
    author      = "Dr. Diessen, van R. J. and Drs. Steenbergen, J. F.",
    title       = "Long {T}erm {P}reservation {S}tudy of the {DNEP} {P}roject",
    institution = "IBM, National Library of the Netherlands",
    year        = "2002",
    month       = "December",
}

Run the demo

If you now generate a parser & lexer from the grammar:

java -cp antlr-3.3.jar org.antlr.Tool BibTex.g

and compile all .java source files:

javac -cp antlr-3.3.jar *.java

and finally run the Main class:

*nix/MacOS

java -cp .:antlr-3.3.jar Main

Windows

java -cp .;antlr-3.3.jar Main

You'll see some output on your console which corresponds to the following AST:

enter image description here

(click the image to enlarge it, generated with graphviz-dev.appspot.com)

To emphasize: I did not properly test the grammar! I wrote it a while back and never really used it in any project.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • 1
    That looks like a great starting point. I'm on the hunt for Java BibTex parsers which can be included in an Apache-licensed project. I suspect that you've intended to donate this grammar to the public domain, but just for the record, can you clarify what license you are releasing this under? – Tom Morris Oct 15 '11 at 22:27
  • @Tom, AFAIK, every user contribution that is posted on SE-sites, are licensed under [Creative-Commons](http://creativecommons.org/licenses/by-sa/3.0/). – Bart Kiers Oct 16 '11 at 05:15
  • Thanks for the quick reply, Bart. I'd forgotten that the Terms of Service CC-BY-SA 3.0 would apply. Now I need to figure out if that's compatible with the BSD license (I mispoke when I said Apache before). I suspect it probably isn't. – Tom Morris Oct 18 '11 at 20:31
  • 3
    @Tom, no problem. Let's pretend you just e-mailed me and I changed something to the grammar and e-mailed you this grammar back. You have my (written) permission to add whatever open-source-like license you'd like to it. If you want a more formal written consent from me, drop me a line (my e-mail is in my profile) and I'll e-mail you the grammar with my permission to license it the way it best suits you. – Bart Kiers Oct 18 '11 at 20:44