2

I have many text documents that I want to compare to one another and remove all text that is exactly the same between them. This is to remove find boiler plate text that is consistent so it can be removed for NLP.

The best way I figured to do this is to find Longest Common Sub-strings that exist or are mostly present in all the documents. However, doing this has been incredibly slow.

Here is an example of what I am trying to accomplish:

DocA:

Title: To Kill a Mocking Bird
Author: Harper Lee
Published: July 11, 1960

DocB:

Title: 1984
Author: George Orwell
Published: June 1949

DocC:

Title: The Great Gatsby
Author: F. Scott Fitzgerald

The output would show something like:

{
    'Title': 3,
    'Author': 3,
    'Published': 2,
}

The results would then be used to strip out the commonalities between documents.

Here is some code I have tested in python. It's incredibly with any significant amount of permutations:

file_perms = list(itertools.permutations(files, 2))
results = {}
for p in file_perms:
    doc_a = p[0]
    doc_b = p[1]

    while True:
        seq_match = SequenceMatcher(a=doc_a, b=doc_b)
        match = seq_match.find_longest_match(0, len(doc_a), 0, len(doc_b)) 

        if (match.size >= 5): 
            doc_a_start, doc_a_stop = match.a, match.a + match.size
            doc_b_start, doc_b_stop = match.b, match.b + match.size 
            match_word = doc_a[doc_a_start:doc_a_stop]

            if match_word in results:
                results[match_word] += 1
            else:
                results[match_word] = 1

            doc_a = doc_a[:doc_a_start] + doc_a[doc_a_stop:]
            doc_b = doc_b[:doc_b_start] + doc_b[doc_b_stop:]
        else: 
            break 

df = pd.DataFrame(
    {
        'Value': [x for x in results.keys()],
        'Count': [x for x in results.values()]
    }
)
print(df)
AlanPear
  • 737
  • 1
  • 11
  • 32
  • 1
    https://en.wikipedia.org/wiki/Longest_common_subsequence_problem – Christian Sloper Feb 05 '20 at 21:15
  • 1
    In fact it's more https://en.wikipedia.org/wiki/Longest_common_substring_problem – Victor Deleau Feb 05 '20 at 21:20
  • No wonder that it is very slow. Finding the longest common substring takes `O(n+m+Nk)` using the best algorithm available. Then you are running this on `O(2 amongst D)` documents, which is `O(n^2)`. Depending on your number of documents, that's a lot of computation – Victor Deleau Feb 05 '20 at 21:24
  • Is there a more optimal way to do this? – AlanPear Feb 05 '20 at 21:28
  • I don't know. But beware that common text might yield different informations about the whole text they are in according to their respective contexts. So maybe you shouldn't remove them – Victor Deleau Feb 05 '20 at 21:32
  • @VictorDeleau this will be used for documents of a given type. So presumably their context would be the same, similar to the examples I posted. – AlanPear Feb 05 '20 at 21:34
  • Maybe you could create a `dict` of word count for each document, then look at the most frequent, use some frequency threshold, and everything that's above it would be deleted from your documents. It would be much faster – Victor Deleau Feb 05 '20 at 21:40
  • Not a bad idea. I imagine there would be a lot of false positives from the free text. Words like 'and' and 'the'. – AlanPear Feb 05 '20 at 21:46
  • That way you will only confirm *again* that `the`, `and`, and `it` (and a well-documented set of others) are still the most common words in English, regardless of your source text. See https://stackoverflow.com/questions/45822827 for example. You might as well skip this step and use a module such as NLTK, which comes with such a list. – Jongware Feb 05 '20 at 22:15

1 Answers1

1

create a set from each document, build a counter for every word how many time it appears iterate over every document, when you find a word that appears in 70% -90% of documents, append it and the word after it as a tuple to a new counter and again..

 from collections import Counter

 one_word = Counter()
 for doc in docs:
     word_list = docs.split(" ")
     word_set = set(word_list)
     for word in word_set:
         one_word[word]+=1

 two_word = Counter()
 threshold = len(docs)*0.7

 for doc in docs:
     word_list = doc.split(" ")
     for i in range(len(word_list)-1):
         if one_word[word_list[i]]>threshold:
             key = (word_list[i], word_list[i+1])

you can play with the threshold and continue as long as the counter is not empty

the docs are lyrics of songs believer, by the river of Babylon, I could stay awake, rattlin bog

from collections import Counter
import os
import glob
TR =1 #threshold
dir = r"D:\docs"
path = os.path.join(dir,"*.txt")
files = glob.glob(path)
one_word = {}
all_docs = {}
for file in files:
    one_word[file] = set()
    all_docs[file] = []
    with open(file) as doc:
        for row in doc:
            for word in row.split():
                one_word[file].add(word)
                all_docs[file].append(word)
#now one_word is a dict where the kay is file name and the value is set of words in it
#all_docs is a dict file name is the key and the value is the complete doc stord in a list word by word
common_Frase = Counter()
for key in one_word:
    for word in one_word[key]:
        common_Frase[word]+=1
#common_Frase containe a count of all words appearence in all files (every file can add a word once)

two_word = {}
for key in all_docs:
    two_word[key] = set()
    doc = all_docs[key]
    for index in range(len(doc)-1):
        if common_Frase[doc[index]]>TR:
            val = (doc[index], doc[index+1])
            two_word[key].add(val)

for key in two_word:
    for word in two_word[key]:
        common_Frase[word]+=1
#now common_Frase contain a count of all two words frase

three_word = {}
for key in all_docs:
    three_word[key] = set()
    doc = all_docs[key]
    for index in range(len(doc)-2):
        val2 = (doc[index], doc[index+1])
        if common_Frase[val2]>TR:
            val3 = (doc[index], doc[index+1], doc[index+2])
            three_word[key].add(val3)


for key in three_word:
    for word in three_word[key]:
        common_Frase[word]+=1

for k in common_Frase:
    if common_Frase[k]>1:
        print(k)

this is the outpot

when like all Don't And one the my hear and feeling Then your of I'm in me The you away I never to be what a ever thing there from By down Now words that was ('all', 'the') ('And', 'the') ('the', 'words') ('By', 'the') ('and', 'the') ('in', 'the')

trigonom
  • 528
  • 4
  • 9
  • What is that second loop supposed to be doing? Also Im assuming docs.split() is supposed to be doc.split() – AlanPear Feb 05 '20 at 22:53
  • thanks, fixed it. the second loop makes tuples of 2 words and count how many times they appear in all docs, we count only tuples that begin with high frequency word. next iteration will be to count 3 words – trigonom Feb 05 '20 at 23:04
  • Ah Im a little lost, and I think the codes still off. docs still in the first loop and nothing is ever done with key. Also, any chance you can post and example input/output? – AlanPear Feb 06 '20 at 15:02