Why are ML/Haskell datatypes useful for defining "languages" like arithmetic expressions?

Question

This is more of a soft question about static type systems in functional languages like those of the ML family. I understand why you need datatypes to describe data structures like lists and trees but defining "expressions" like those of propositional logic within datatypes seems to bring just some convenience and is not necessary. For example

datatype arithmetic_exp = Constant of int
                        | Neg of arithmetic_exp
                        | Add of (arithmetic_exp * arithmetic_exp)
                        | Mult of (arithmetic_exp * arithmetic_exp)

defines a set of values, on which you can write an eval function which would give you the result. You could just as well define 4 functions: const: int -> int, neg: int -> int, add: int * int -> int and mult: int * int -> int and then an expression of the sort add (mult (const 3, neg 2), neg 4) would give you the same thing without any loss of static security. The only complication is that you have to do four things instead of two. While learning SML and Haskell I've been trying to think about which features give you something necessary and which are just a convenience, so this is the reason why I'm asking. I guess this would matter if you want to decouple the process of evaluating a value from the value itself but I'm not sure where that would be useful.

Thanks a lot.

Are you asking what the purpose of ASTs is? – sepp2k Sep 05 '17 at 10:28 — sepp2k, Sep 05 '17 at 10:28

score 7 · Accepted Answer · edited Sep 05 '17 at 11:33

There is a duality between initial / first-order / datatype-based encodings (aka deep embeddings) and final / higher-order / evaluator-based encodings (aka shallow embeddings). You can indeed typically use a typeclass of combinators instead of a datatype (and convert back and forth between the two).

Here is a module showing the two approaches:

{-# LANGUAGE GADTs, Rank2Types #-}

module Expr where

data Expr where
  Val :: Int -> Expr
  Add :: Expr -> Expr -> Expr

class Expr' a where
  val :: Int -> a
  add :: a -> a -> a

You can see that the two definitions look eerily similar. Expr' a is basically describing an algebra on Expr which means that you can get an a out of an Expr if you have such an Expr' a. Similarly, because you can write an instance Expr' Expr, you're able to reify a term of type forall a. Expr' a => a into a syntactic value of type Expr:

expr :: Expr' a => Expr -> a
expr e = case e of
  Val n   -> val n
  Add p q -> add (expr p) (expr q)

instance Expr' Expr where
  val = Val
  add = Add

expr' :: (forall a. Expr' a => a) -> Expr
expr' e = e

In the end, picking a representation over another really depends on what your main focus is: if you want to inspect the structure of the expression (e.g. if you want to optimise / compile it), it's easier if you have access to an AST. If, on the other hand, you're only interested in computing an invariant using a fold (e.g. the depth of the expression or its evaluation), a higher order encoding will do.

Thanks @chi. I didn't know about these tags to get Haskell highlighting! — gallais, Sep 05 '17 at 11:55
To this point, [Oleg Kiselyov's lecture notes on Typed Tagless Final Interpreters](http://okmij.org/ftp/tagless-final/course/) are worth reading. They explain the initial/final duality in detail, demonstrate the unexpected flexibility of the final representation in the context of a polymorphic type system, and show some of the incredible ways a "final interpreter" can be extended to handle new language constructs without modifying any existing code. When I first read these notes, they blew my mind. — K. A. Buhr, Sep 06 '17 at 22:15

amalloy · Answer 2 · 2017-09-06T19:43:02.380

The ADT is in a form you can inspect and manipulate in ways other than simply evaluating it. Once you hide all the interesting data in a function call, there is no longer anything you can do with it but evaluate it. Consider this definition, similar to the one in your question, but with a Var term to represent variables and with the Mul and Neg terms removed to focus on addition.

data Expr a = Constant a
            | Add (Expr a) (Expr a)
            | Var String
            deriving Show

The obvious function to write is eval, of course. It requires a way to look up a variable's value, and is straightforward:

-- cheating a little bit by assuming all Vars are defined
eval :: Num a => Expr a -> (String -> a) -> a
eval (Constant x) _env = x
eval (Add x y) env = eval x env + eval y env
eval (Var x) env = env x

But suppose you don't have a variable mapping yet. You have a large expression that you will evaluate many times, for different choices of variable. And some silly recursive function built up an expression like:

Add (Constant 1) 
    (Add (Constant 1) 
         (Add (Constant 1) 
              (Add (Constant 1) 
                   (Add (Constant 1) 
                        (Add (Constant 1) 
                             (Var "x"))))))

It would be wasteful to recompute 1+1+1+1+1+1 every time you evaluate this: wouldn't it be nice if your evaluator could realize that this is just another way of writing Add (Constant 6) (Var "x")?

So, you write an expression optimizer, which runs before any variables are available and tries to simplify expressions. There are many simplification rules you could apply, of course; below I've implemented just two very easy ones to illustrate the point.

simplify :: Num a => Expr a -> Expr a
simplify (Add (Constant x) (Constant y)) = Constant $ x + y
simplify (Add (Constant x) (Add (Constant y) z)) = simplify $ Add (Constant $ x + y) z
simplify x = x

Now how does our silly expression look?

> simplify $ Add (Constant 1) (Add (Constant 1) (Add (Constant 1) (Add (Constant 1) (Add (Constant 1) (Add (Constant 1) (Var "x"))))))
Add (Constant 6) (Var "x")

All the unnecessary stuff has been removed, and you now have a nice clean expression to try for various values of x.

How do you do the same thing with a representation of this expression in functions? You can't, because there is no "intermediate form" between the initial specification of the expression and its final evaluation: you can only look at the expression as a single, opaque function call. Evaluating it at a particular value of x necessarily evaluates each of the subexpressions anew, and there is no way to disentangle them.

Here's an extension of the functional type you propose in your question, again enriched with variables:

type FExpr a = (String -> a) -> a

lit :: a -> FExpr a
lit x _env = x

add :: Num a => FExpr a -> FExpr a -> FExpr a
add x y env = x env + y env

var :: String -> FExpr a
var x env = env x

with the same silly expression to evaluate many times:

sample :: Num a => FExpr a
sample = add (lit 1)
             (add (lit 1)
                  (add (lit 1)
                       (add (lit 1)
                            (add (lit 1)
                                 (add (lit 1)
                                      (var "x"))))))

It works as expected:

> sample $ \_var -> 5
11

But it has to do a bunch of addition every time you try it for a different x, even though the addition and the variable are mostly unrelated. And there's no way you can simplify the expression tree. You can't simplify it while defining it: that is, you can't make add smarter, because it can't inspect its arguments at all: its arguments are functions which, as far as add is concerned, could do anything at all. And you can't simplify it after constructing it, either: at that point you just have an opaque function that takes in a variable lookup function and produces a value.

By modeling the important parts of your problem as data types in their own right, you make them values that your program can manipulate intelligently. If you leave them as functions, you get a shorter program that is less powerful, because you lock all the information inside lambdas that only GHC can manipulate.

And once you've written it with ADTs, it's not hard to collapse that representation back into the shorter function-based representation if you want. That is, it might be nice to have a function of type

convert :: Expr a -> FExpr a

But in fact, we've already done this! That's exactly the type that eval has. You just might not have noticed because of the FExpr type alias, which is not used in the definition of eval.

So in a way, the ADT representation is more general and more powerful, acting as a tree that you can fold up in many different ways. One of those ways is evaluating it, in exactly the way that the function-based representation does. But there are others:

Simplify the expression before evaluating it
Produce a list of all the variables that must be defined for this expression to be well formed
Count how deeply nested the deepest part of the tree is, to estimate how many stack frames an evaluator might need
Convert the expression to a String approximating a Haskell expression you could type to get the same result

So if possible you'd like to work with information-rich ADTs for as long as you can, and then eventually fold the tree up into a more compact form once you have something specific to do with it.

Why are ML/Haskell datatypes useful for defining "languages" like arithmetic expressions?

2 Answers2