Parser generation

Question

i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..

i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...

i.e I want to create a parser for c programs in python

Maybe the OP means that he wants to do plagiarism detection against programs written in C using Python as the language to write his detector in, or vice versa. More information is necessary. — Dan Joseph, Oct 20 '10 at 12:49
+1 for a reasonable question, to help offset all the dings. Seemed pretty clear what he was asking. Then again, I've built clone detectors so I'm probably sensitive to the phrasing. — Ira Baxter, Oct 20 '10 at 19:31

score 3 · Answer 1 · answered Oct 20 '10 at 12:05

3

If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful

answered Oct 20 '10 at 12:05

rubik

8,814
9
58
88

These are good idea only if OP defines a very simple model of C, which for an academic project might be OK. – Ira Baxter Oct 20 '10 at 19:49

Ira Baxter · Answer 2 · 2010-10-20T19:41:41.030

1

Building a real C parser by yourself is a really big task.

I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.

You'll have plenty of work to do for your plagiarism detector after you are done parsing C.

edited Oct 20 '10 at 19:41

answered Oct 20 '10 at 19:29

Ira Baxter

93,541
22
172
341

1

Having built both parsers and clone detectors, I think they're about equally hard. C at least has a documented definition as reference (sort of, the real compilers vary from it more than you'd expect); for clone detection you need to decide what heuristics your're going to use and then do what you can to make them as effective as possible. As one implementation usable on C code, see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.1272&rep=rep1&type=pdf – Ira Baxter Oct 20 '10 at 19:45

score 0 · Answer 3 · answered Oct 20 '10 at 19:55

0

I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.

what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

answered Oct 20 '10 at 19:55

SingleNegationElimination

151,563
33
264
304

Depends on the purpose of his detector. If you want good answers for plagiarism on C source code, you need to do this in a way that is independent of formatting. Comparing "lines of text" won't do this; so, you need something that isn't lines. Tokens is a useful grain for doing this. Better are abstract syntax trees, which is what OP appears to be fishing for; see my answer for a reference to technical paper that does just exactly this. – Ira Baxter Oct 20 '10 at 23:05

Parser generation

3 Answers3

Linked