-2

i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..

i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...

i.e I want to create a parser for c programs in python

Aneeshia
  • 45
  • 1
  • 2
  • 7
  • Indeed: What is this I don’t even –  Oct 20 '10 at 10:06
  • 21
    I'm sure there's some code out there you can copy. – Glenn Maynard Oct 20 '10 at 10:20
  • Maybe the OP means that he wants to do plagiarism detection against programs written in C using Python as the language to write his detector in, or vice versa. More information is necessary. – Dan Joseph Oct 20 '10 at 12:49
  • +1 for a reasonable question, to help offset all the dings. Seemed pretty clear what he was asking. Then again, I've built clone detectors so I'm probably sensitive to the phrasing. – Ira Baxter Oct 20 '10 at 19:31

3 Answers3

3

If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful

rubik
  • 8,814
  • 9
  • 58
  • 88
  • These are good idea only if OP defines a very simple model of C, which for an academic project might be OK. – Ira Baxter Oct 20 '10 at 19:49
1

Building a real C parser by yourself is a really big task.

I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.

You'll have plenty of work to do for your plagiarism detector after you are done parsing C.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 1
    Having built both parsers and clone detectors, I think they're about equally hard. C at least has a documented definition as reference (sort of, the real compilers vary from it more than you'd expect); for clone detection you need to decide what heuristics your're going to use and then do what you can to make them as effective as possible. As one implementation usable on C code, see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.1272&rep=rep1&type=pdf – Ira Baxter Oct 20 '10 at 19:45
0

I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.

what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • Depends on the purpose of his detector. If you want good answers for plagiarism on C source code, you need to do this in a way that is independent of formatting. Comparing "lines of text" won't do this; so, you need something that isn't lines. Tokens is a useful grain for doing this. Better are abstract syntax trees, which is what OP appears to be fishing for; see my answer for a reference to technical paper that does just exactly this. – Ira Baxter Oct 20 '10 at 23:05