1

i briefly understand how TF-IDF works, for detecting plagiarism in articles, it does make sense.

Now i was told to use it against programming source code, how can this work ? In article most words are natural language words say English, you can count these words. Now in source code, each person can define all kinds strange variable names, so this counting of the words doesn't make much sense to me.

Even if i just want to count function name, my own function name could be strange as well, while system/library function names are useful for TF.

Anyone can help to explain more ? Thanks !

user3552178
  • 2,719
  • 8
  • 40
  • 67
  • Imagine that you had a corpus of every single source code file that you had written in a specific language. You then tokenize the corpus into n "words" (e.g., variable names, function names, operators, etc.) and get the counts. Now these words will have a specific frequency distribution across your corpus... basically it acts as n-length vector that is a "fingerprint" of you, the writer of the code. Given a new code file written of an unknown author, you just tokenize that doc and compare the distributions. – Brandon Loudermilk Feb 24 '19 at 14:52
  • @BrandonLoudermilk In theory you are right if there're many of your source code files to find your fingerprint, but it's not efficient, variable/function names are still different in 2 different files, e.g. doMathHomework(), doEnglishHomework(), we human read it we know it's the same style. But for a simple TF-IDF (without extra processing), they look 2 different words. – user3552178 Feb 24 '19 at 21:41
  • 1
    Yes, there is variation intra-author variation in convention/style, but that doesn't change the fact that there is inter-author variation as well... its these relative differences in distribution that your classifier will ultimately have capture to perform this task. On any given token a target text may or may not be similar to the dist in the corpus, but on average a target text written by the same author as the corpus will have more similarly distributed tokens than an author. – Brandon Loudermilk Feb 25 '19 at 17:54

0 Answers0