3

I am working on a web application that tracks helpdesk entries. We want to find a way to prevent people from copying and pasting their notes regarding common issues - we want original helpdesk entries to be written for every trouble-call.

In any case, we have thousands of entries and some of them are similar, I am trying to find a way of comparing them all to eachother and pointing out any entries that are very similar to others, i.e. 80% likely to be a direct copy, etc.

I've looked into similar_text() and a few other built-in PHP functions, but I am interested in hearing if anyone else has done something similar before. I don't believe I can use similar_text() efficiently since I need to compare multiple entries against each other, not two strings.

Any input is appreciated.

Andy
  • 3,141
  • 3
  • 27
  • 22
  • 1
    You might find [this](http://stackoverflow.com/questions/1085048/how-would-you-code-an-anti-plagiarism-site) a worthwhile read. – alex May 18 '11 at 04:13

3 Answers3

0

You may want to consider giving the Solr database a try. While your final schema will likely contain many different fields, the main field would be of the type "text" and would contain the text of the helpdesk entry. The default Solr schema (requiring no modification) automatically tokenizes the data in the text field, indexes the data in such a way that searches for synonyms are found, "cities" will match "cities", etc.

In the end, using Solr, you will end up with a scalable solution both from a performance standpoint and a functional standpoint.

Jason Palmer
  • 731
  • 4
  • 17
0

I do think similar_text() would do what you want. As long as your machine has enough memory to handle the comparisons, it should work fine. Also look at levenshtein() and soundex().

dtbarne
  • 8,110
  • 5
  • 43
  • 49
0

First off, why do you care? If it's a common issue that can be responded with via a copy and paste, why is that not the right thing to do? It sounds like you're generating more work for the sake of work.

Second off, you could look into something like: http://en.wikipedia.org/wiki/W-shingling

If the other options presented here don't suffice.

James
  • 8,512
  • 1
  • 26
  • 28
  • 1
    James - I care because we are on a DoD project that requires us to produce quality reports and feedback. This isn't a a traditional "help desk" and in our situation no two situations should be very close. I can't go into all the details, but hopefully that gives you the context. In any case, thanks for the link. – Andy May 18 '11 at 12:27