0

I'm using the code below to highlight a single matching sequence. (Just copy-paste it in a new Colab notebook, it'll work perfectly.

import textwrap

from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher

import nltk
nltk.download('punkt')
print('')

text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America.
'''

text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.

Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''

temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')

search_length = len(text1)
total_length = len(text2)

matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]

tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])

print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')

enter image description here

Now when I try highlighting multiple sequences, the code breaks (doesn't show the full text, and doesn't highlight the second or more sequence).

import textwrap

from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher

import nltk
nltk.download('punkt')
print('')

text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. North American mainland at least 12,000 years ago, and advanced cultures began to appear later on.
'''

text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.

Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''

temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')

search_length = len(text1)
total_length = len(text2)

matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]

tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])

print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')

I need to highlight at least 2 sequences. I'm trying to make some sort of if else statement right now, maybe it'll work. Or is there a better library?

Mystic
  • 143
  • 9

0 Answers0