0

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).

I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.

I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.

Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error

There is no package with the name WOO.EV.TCP

which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.

Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.

I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.

It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.

Rainer Joswig
  • 136,269
  • 10
  • 221
  • 346
user3113723
  • 169
  • 13
  • 1
    I'm pretty sure you can't (in general) parse Lisp without executing parts of it because you can run arbitrary code at parse time, which can then modify the reader so all following code is parsed differently. (See also: Parsing Perl is impossible without implementing a full Perl interpreter, which is the same problem.) – melpomene May 18 '19 at 21:09
  • I'm aware of this, but it doesn't have to be a fully-general parser. That's why I explained the use-case. – user3113723 May 18 '19 at 22:36
  • 1
    Here are two quite similar questions: https://stackoverflow.com/questions/52405865/reading-qualified-symbols https://stackoverflow.com/questions/53127961/parse-string-with-read-and-ignore-package-namespaces I'm hesitant to mark them as duplicates because the intent seems to be different. – Svante May 19 '19 at 10:05
  • I bet there are special purpose reader implementations which can do that. The plain CL reader does not support it. In CL a package needs to be defined before a symbol with a package prefix can be read. – Rainer Joswig May 19 '19 at 16:25
  • @Svante that's exactly what I was looking for w/r/t package symbols, although I have other problems. I guess it's kind of a duplicate. – user3113723 May 19 '19 at 23:07

2 Answers2

2

I am almost sure there is no easy portable way to do this for a number of reasons.

(Just limiting things to the non-existent-package problem for now.)

First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.

Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.

There are several possible ways out of this bit of the problem.

If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.

If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.

If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.


But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.

0

I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.

For future readers: The various people saying this is impossible in general are probably right. The other questions posted by @Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from @tfb, but I don't have time for that shit.

user3113723
  • 169
  • 13
  • Given that things inside `#+` probably are not readable at all by implementations which ignore them because there are missing packages then to 'make a list' you need to solve the tokenizing problem. There is no easy solution to this problem, if you want something that will actually work properly. If you don't care about that (which is fine), then what you're doing is probably fine. –  May 20 '19 at 08:05