I'm trying to write a C module to lexically analyse Python code. How can I do it?
-
3Cut the net speak, and can you state your problem in a much more specific way? – Xavier Ho May 15 '10 at 15:02
-
1(To others confused by Xavier's comment, it is against the first revision of the question. The more recent edit is much clearer.) – Oddthinking May 15 '10 at 16:58
1 Answers
The complete, detailed specification for doing lexical analysis of Python code is here.
As you can see, there are a lot of cases you need to cover. One help is that you will always be able to check most easily if your C-implemented lexical analyzer is correct for a given Python fragment: it will have to return exactly what the Python-implemented module tokenize in Python's standard library does.
As you can see in tokenize's sources, it's several hundred lines of Python, so you can easily extrapolate to needing thousands of lines of C -- definitely not a weekend project;-)
Of course, as a starting point, you can fork Python's own Parser/tokenizer.c -- that's less than 2000 lines (amazingly short for what it does!), but in good part because it's relying on quite a few other bits and pieces from Python's runtime (if your implementation needs to be stand-alone you'll therefore need to reproduce those).
If you're a very experienced programmer with strong understanding of the Python's codebase, and can just sprint on this for all your waking hours, you might make it in a week or so. Under normal circumstances, I'd say expecting a month of work would be a bit optimistic. What's your deadline?

- 854,459
- 170
- 1,222
- 1,395
-
1I'd also ask why you want to do this in C rather than in Python. – Noufal Ibrahim May 15 '10 at 17:25