How i would aproach this, and custom enhancements can be added lately:
Remove everything that is not a letter or number;
Use explode()
with empty space character as delimiter and find all the words; now you know how many words you have in that article;
Now, you must find out how many times a word apears in that article, and increase the word indicator each time that word is found in the text;
Store this into an array, like:
$words['wordX']++;
Do this also with the seccond article that you want to check with;
Now, compare them; You know the original data; some conclusions ca be made at this step;
Using the big characters, like J from John, F from Feudalism, you can also make some conclusions;
From here you may know if the article is about the same thing, and this could be the real step #1
Now, somehow you have to parse both articles, word by word, in the same time, and see the differecnce beetween them.
A student can add a own "original" sentence after each sentence/paragraph found in the original article.
Make sure that if you advance to much in the parsing process on one of the articles, you somehow keep a balanced parsing process and try to parse the seccond article until you reach that balance.
i see 2 for
instructions, maybe 3, or instead of 3, a function that tryes to keep the balance in the parsing process.
Also, you have to use explode()
and check sentence by sentence, and word by word from each sentence and find the similarity.
I am sure that you get the idea, but i say again, you cant parse the entire WWW.