0

My goal is to be able to detect computer generated spun content. Here are some examples of spun text:

"As a explicit art fashionable for an advertising organization, you will job to assist put up for auction customers' crop and/or armed forces to their aim marketplace by your original skill and technological ability."

"The actual apple iphone application shop is definitely an abundant cherish residence of useful apps."

Basically, the computer has replaced words with various synonyms in an attempt to make content unique to bypass plagiarism detection. My goal is to make a system that can detect this gibberish text. What are some ways this could be accomplished?

Sebastian Brosch
  • 42,106
  • 15
  • 72
  • 87
Zach Johnson
  • 2,047
  • 6
  • 24
  • 40

1 Answers1

1

What you want to do is make an ngram language model. An ngram language model is a statistical representation of word pair occurrences in a language and is used in machine translation, sentiment analysis, and classification tasks such as predicting whether a movie review is positive or negative. You classification task would be whether each sentence is spun content or not.

A classification model like naive bayes (implemented in NLTK) could help with you problem. In training it makes a language model, then uses the model for prediction. To train the model you will need your spun content examples and a bunch of regular English text. The more you have of both the better! All documents (you can treat each sentence as a document) should be labeled to indicate whether they are spun content or not.

Here is a list of English corpora for you non-spun text.

More complex models may work better and you can compare them side by side very easily. I like using scikit-learn for that kind of thing.

aberger
  • 2,299
  • 4
  • 17
  • 29