i briefly understand how TF-IDF works, for detecting plagiarism in articles, it does make sense.
Now i was told to use it against programming source code, how can this work ? In article most words are natural language words say English, you can count these words. Now in source code, each person can define all kinds strange variable names, so this counting of the words doesn't make much sense to me.
Even if i just want to count function name, my own function name could be strange as well, while system/library function names are useful for TF.
Anyone can help to explain more ? Thanks !