I need to do a pretty complex matching of phrases. I have large bodies of text in files which exceed 1000 words each.
The phrases I am searching for (searchphrase) are like this:
Investment does not mean: i. Claims to money that arise solely from: 1. Commercial contracts for the sale of goods or services by a national or an enterprise of a party to an enterprise in the territory of the other party, or 2. The extension of credit in connection with a commercial transaction, such as trade financing other than loans or claims to money previously covered.
I want to know if the phrase occurs in each of the files I have. However, the files will not have content that are exact replicas of the phrase. Instead the file (textfile) will be a large document with a paragraph like:
But investment does not mean claims to money derived solely from commercial transactions designed exclusively for the sale of goods or services by a national or legal person in the territory of one Contracting Party to a national or legal person in the territory of the other Contracting Party, credits to finance commercial transactions such as trade financing, and other credits with a duration of less than three years, as well as credits granted to the State or to a State enterprise.
As you can see, searchphrase is pretty similar in actual meaning to this paragraph from textfile. There is also considerable overlap in the keywords. Hence, I should get a match.
What sort of algorithm should I try and use to code this? Are pre-coded modules available anywhere that do this job?