2

I am writing a compiler for a DSL. After reading the source file into a string, all the rest steps (parsing, type checking, and codegen) are all pure code, transforming the code from one representation to another. All is good till there are dependencies in the source file (think of #include preprocessor in C). The parser needs to read the dependent files and recursively parse them. This makes it not pure anymore. I have to change it from returning AST to IO AST. Also, all the subsequent steps (type checking and codegen) have to return IO types as well, which requires significant changes. What is a good way to handle reading dependent files in this case?

p.s. I can use unsafePerformIO, but that seems a hacky solution that can lead to technical debt.

sinoTrinity
  • 1,125
  • 2
  • 15
  • 27
  • Can you conceptually split this into two phases? The first phase recognizes and expands the inclusions, the other parses the resulting "complete" DSL code. This is how C works; the C processor expands all its directives before the C parser takes over. – chepner Sep 15 '20 at 17:36
  • You can change the type to `m AST` instead of `IO AST`. This prevents any unrelated IO from sneaking in, and you can pass in a `readExternal :: FilePath -> IO String` to get an IO version for prod, or `readMocked :: FilePath -> Identity String` to get a pure version for testing or embedding. – that other guy Sep 15 '20 at 20:41

1 Answers1

5

A good solution is to parse into an AST containing dependency information, then resolve the dependencies separately, outside the parser. For example, suppose you have a format that may be an #include line or a content line:

data WithIncludes = WithIncludes [ContentOrInclude]

data ContentOrInclude
  = Content String
  | Include FilePath

And a parser parse :: String -> WithIncludes so that these files:

  • file1:

    before
    #include "file2"
    after
    
  • file2:

    between
    

Parse to these representations:

file1 = WithIncludes
  [ Content "before"
  , Include "file2"
  , Content "after"
  ]

file2 = WithIncludes
  [ Content "between"
  ]

You can add another type representing a flattened file with the imports resolved:

data WithoutIncludes = WithoutIncludes [String]

And separately from parsing, load and recursively flatten includes:

flatten :: WithIncludes -> IO WithoutIncludes
flatten (WithIncludes ls) = WithoutIncludes . concat <$> traverse flatten' ls
  where
    flatten' :: ContentOrInclude -> IO [String]
    flatten' (Content content) = pure [content]
    flatten' (Include path) = do
      contents <- readFile path
      let parsed = parse contents
      flatten parsed

Then the result is:

flatten file1 == WithoutIncludes
  [ "before"
  , "between"
  , "after"
  ]

Parsing remains pure, and you just have an IO wrapper around it driving which files to load. You can even reuse the logic here for loading a single file:

load :: FilePath -> IO WithoutIncludes
load path = flatten $ WithIncludes [Include path]

It’s also a good idea to add logic here to check for import cycles, for example by adding an accumulator to flatten containing a Set of canonicalised FilePaths, and checking at each Include that you haven’t seen the same FilePath already.

For a more complex AST, you may want to share most of the structure between the unresolved and resolved types. In that case, you can parameterise the type by whether it’s resolved, and have the unresolved & resolved types be aliases for the underlying AST type with different arguments, for instance:

data File i = File [ContentOrInclude i]

data ContentOrInclude i
  = Content String
  | Include i

type WithIncludes = File FilePath
type WithoutIncludes = File [String]
Jon Purdy
  • 53,300
  • 8
  • 96
  • 166
  • Wouldn't `WithoutIncludes` be `File Void` to match the previous definition? Or are we not doing that? – HTNW Sep 15 '20 at 18:38
  • @HTNW: Both are valid! If you want to retain the structure of includes, and have the result be `File [Content "before", Include ["between"], Content "after"]`, then `[String]` makes sense; if you want to flatten everything so that there are no more `Include` constructors, `Void` makes sense, and the result is `File [Content "before", Content "between", Content "after"]`. The trouble with `Void` is that you still have to *match* the `Include` constructor, and use `absurd` to convince the typechecker that you’ve handled the case. – Jon Purdy Sep 15 '20 at 18:42
  • I think if you want to disable a constructor entirely, then GADTs—or more generally, constraints on constructors—are a more ergonomic approach than filling a type parameter with `Void`. For example, `data CT = C | I; data CI (t :: CT) where { Content :: String -> CI t; Include :: FilePath -> CI I }` and `CI 'C` to disable includes, or `data Phase = Parsed | Flattened; type family HasInclude (p :: Phase) :: Bool where { HasInclude 'Parsed = 'True; HasInclude 'Flattened = 'False }; data CI (p :: Phase) where { Content :: String -> CI p; Include :: (HasInclude p ~ 'True) => FilePath -> CI p }` – Jon Purdy Sep 15 '20 at 18:52
  • How to add error handling? For example, included file does not exist. `parse :: String -> Either ErrorType WithIncludes` & `flatten :: WithIncludes -> IO (Either ErrorType WithoutIncludes)` – sinoTrinity Sep 16 '20 at 06:29
  • 1
    @sinoTrinity: You can just use `Either` like that and pattern-match on the result. It might be cleaner to use `ExceptT` from `Control.Monad.Trans.Except`: when you have `IO (Either e a)`, you can wrap it in `ExceptT` to get `ExceptT e IO a`, which lets you keep using normal functions like `traverse` instead of manually matching `Left`/`Right`. To raise an error, use `throwE` or `except`, e.g. `except (parse file)`; to run an `IO` action, use `lift` from `Control.Monad.Trans.Class`, e.g. `lift (doesFileExist path)`. At the end, you can “run” it as an `IO (Either e a)` with `runExceptT`. – Jon Purdy Sep 17 '20 at 20:19