17

A couple of days ago, I read a blog entry (http://ayende.com/Blog/archive/2008/09/08/Implementing-generic-natural-language-DSL.aspx) where the author discuss the idea of a generic natural language DSL parser using .NET.

The brilliant part of his idea, in my opinion, is that the text is parsed and matched against classes using the same name as the sentences.

Taking as an example, the following lines:

Create user user1 with email test@email.com and password test
Log user1 in
Take user1 to category t-shirts
Make user1 add item Flower T-Shirt to cart
Take user1 to checkout

Would get converted using a collection of "known" objects, that takes the result of parsing. Some example objects would be (using Java for my example):

public class CreateUser {
    private final String user;
    private String email;
    private String password;

    public CreateUser(String user) {
    this.user = user;
    }

    public void withEmail(String email) {
    this.email = email;
    }

    public String andPassword(String password) {
        this.password = password;
    }
}

So, when processing the first sentence, CreateUser class would be a match (obviously because it's a concatenation of "create user") and, since it takes a parameter on the constructor, the parser would take "user1" as being the user parameter.

After that, the parser would identify that the next part, "with email" also matches a method name, and since that method takes a parameter, it would parse "test@email.com" as being the email parameter.

I think you get the idea by now, right? One quite clear application of that, at least for me, would be to allow application testers create "testing scripts" in natural language and then parse the sentences into classes that uses JUnit to check for app behaviors.

I'd like to hear ideas, tips and opinions on tools or resource that could code such parser using Java. Better yet if we could avoid using complex lexers, or frameworks like ANTLR, which I think maybe would be using a hammer to kill a fly.

More than that, if anyone is up to start an open source project for that, I would definitely be interested.

Fabian Steeg
  • 44,988
  • 7
  • 85
  • 112
kolrie
  • 12,562
  • 14
  • 64
  • 98
  • Similar to Glurk's answer, therefore as comment: If you look for executable "natural" language specs, you should give Cucumber (http://cukes.info/) a try. Together with JRuby (and RSpec), you can use it for Java-based BDD (http://behaviour-driven.org/). Alternatives include EasyB and JBehave. – Nils Wloka Mar 25 '09 at 11:00
  • What is DSL? is it Disambiguation of Similar Languages? see corporavm.uni-koeln.de/vardial/sharedtask.html – alvas Mar 23 '14 at 20:52

6 Answers6

27

Considering the complexity of lexing and parsing, I don't know if I'd want to code all that by hand. ANTLR isn't that hard to pickup and I think it is worthing looking into based on your problem. If you use a parse grammar to build and abstract syntax tree from the input, its pretty easy to then process that AST with a tree grammar. The tree grammar could easily handle executing the process you described.

You'll find ANTLR in many places including Eclipse, Groovy, and Grails for a start. The Definitive ANTLR Reference even makes it fairly straightforward to get up to speed on the basic fairly quickly.

I had a project that had to handle some user generated query text earlier this year. I started down a path to manually process it, but it quickly became overwhelming. I took a couple days to get up the speed on ANTLR and had an initial version of my grammar and processor running in a few days. Subsequent changes and adjustments to the requirements would have killed any custom version, but required relatively little effort to adjust once I had the ANTLR grammars up and running.

Good luck!

Joe Skora
  • 14,735
  • 5
  • 36
  • 39
  • Joe, thanks. I added that book to my cart on Amazon a couple of times. Do you think it would be easy to create dynamic grammar trees based on the registered parsers? The library would have to use reflection to extract class name, methods, (...) and create the grammar tree for ANTLR, right? – kolrie Sep 27 '08 at 20:27
  • You can insert Java (or another, ANTLR can generate a variety of languages) directly into the grammar. I used one grammar to parse my DSL and a second to walk the AST tree, processing the nodes. Since it all this runs in your app it can easily create objects and call methods. – Joe Skora Sep 27 '08 at 20:45
  • 2
    It took a couple of days to get my head wrapped around ANTLR, having never taken a lexer/parser/compile course. I am very glad I did it as it will be useful again and again in the future. Parr wrote ANTLR so the book is a great resource and a well written introduction to lexing and parsing too. – Joe Skora Sep 27 '08 at 20:48
  • Joe, your feedback was excellent, I will definetiy buy the book. – kolrie Sep 27 '08 at 22:55
  • If you got to the Pragmatic Bookshelf site (http://pragprog.com/titles/tpantlr/the-definitive-antlr-reference) you can get the book and a PDF copy for 45.75 + shipping. Good luck. You won't regret picking up an new tool for your skill set. – Joe Skora Sep 27 '08 at 22:58
  • I'll buy it for sure. I am on the same situation you are now, I don't have any experiences on the lexer/parser/compile trio as well. And that has always been something on my furute learning list. Now is the time! Thank you again for your valuable feedback. – kolrie Sep 27 '08 at 23:38
10

You might want to consider Xtext, which internally uses ANTLR and does some nice things like auto-generating an editor for your DSL.

Fabian Steeg
  • 44,988
  • 7
  • 85
  • 112
10

If you call that "natural language", you're deluding yourself. It's still a programming language, just one that tries to mimic natural language - and I suspect that it will fail once you get into implementation details. In order to make in unambiguous, you'll have to put restrictions on the syntax that will confuse the users who've been led to think that they're writing "English".

The advantage of a DSL is (or should be, at any rate) is that it's simple and clear, yet powerful in regard to the problem domain. Mimicking a natural language is a secondary concern, and may in fact be counter-productive to those primary goals.

If someone is too stupid or lacks the ability for formally rigorous thinking that's required for programming, then a programming language that mimicks a natural one will NOT magically turn them into a programmer.

When COBOL was invented, some people seriously believed that within 10 years there would be zero demand for professional programmers, since COBOL was "like English", and anyone who needed software could write it himself. And we all know how that's been working out.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 2
    +1, yeah it's killing me that companies are actually abandoning COBOL. Why hire expensive programmers when there are plenty of cheap English-speaking folk out there? – Jonas Byström Sep 05 '12 at 14:40
4

The first time I heard of DSL was from Jetbrains, the creator of IntellJ Idea.

They have this tool: MPS ( Meta Programming System )

cod3monk3y
  • 9,508
  • 6
  • 39
  • 54
OscarRyz
  • 196,001
  • 113
  • 385
  • 569
1

You might find this multi-part blog series I did on using Antlr to be useful as a starting point. It uses Antlr 2, so some stuff will be different for Antlr 3:

http://tech.puredanger.com/2007/01/13/implementing-a-scripting-language-with-antlr-part-1-lexer/

Mark Volkman's presentations/articles on Antlr are quite helpful as well:

http://www.ociweb.com/mark/programming/ANTLR3.html

I will second the suggestion about the Definitive ANTLR book, which is also excellent.

Alex Miller
  • 69,183
  • 25
  • 122
  • 167
0

"One quite clear application of that, at least for me, would be to allow application testers create "testing scripts" in natural language and then parse the sentences into classes that uses JUnit to check for app behaviors"

What you are talking about here sounds exactly like the tool, FitNesse. Exactly as you describe, clients write acceptance tests "scripts" in some kind of language that makese sense to them, and programmers build systems that make the tests pass. Even the implementation you talk about is pretty much exactly how FitNesse works - the vocabulary used in the scripts are concatenated to form function names etc, so that the FitNesse framework knows what function to call.

Anyway, check it out :)