1

I have a CSV of sentences and another CSV where the same sentences are broken and jumbled up.

For example, one CSV has:

The quick brown fox jumps over the lazy dog.

And the other CSV has:

jumps over the
The quick brown fox
lazy dog.

Each CSV has more than 1 sentence but hopefully, you get the idea from the above example.

I've used fuzzy matching to see they match but now I'd like to reconstruct the sentence.
Is it possible with Python to reconstruct the jumbled CSV to match the full sentence?

mousetail
  • 7,009
  • 4
  • 25
  • 45
  • You could simply check if every part of the sentence appears in the full sentence – mousetail Apr 13 '21 at 07:52
  • You mean you want to reorder the rows in the jumbled CSV so the snippets appear in the correct order? – Elias Strehle Apr 13 '21 at 07:57
  • @EliasStrehle yes that's it exactly! the only problem is there will be more than 1 sentence to match and multiple snippets jumbled in the same csv. – Paula Clark Apr 13 '21 at 08:08
  • `'The quick brown fox jumps over the lazy dog.'.find('jumps over the')` gives you the index position of a substring. Do this for every substring and sort by index. (Might not work as expected if substrings are ambiguous or duplicated in your jumbled CSV). – Elias Strehle Apr 14 '21 at 12:06

1 Answers1

0

Great and challenging question!

I tried something and have explained the same in the comments below in the code:

#Original Sentences
clean_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A wizard's job is to vex chumps quickly in fog."
]

#CSV in the form of a list
jumbled_sentences = [
    "is to vex chumps ",
    "jumps over the ",
    "The quick brown fox ",
    "quickly in fog.",
    "lazy dog.",
    "A wizard's job ",
]

# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # use this for faster results when a lot of fuzzywuzzy operations are to be done

for clean_sentence in clean_sentences:

    ordered_sentences = []

    #we find only those jumbled sentences who are 100% present(thats why partial ratio) in the original sentence
    fuzzResults = process.extract(clean_sentence, jumbled_sentences, scorer=fuzz.partial_ratio, score_cutoff=100)

    sentences_found = [fuzzResult[0] for fuzzResult in fuzzResults] #retrieve only sentence from fuzzy result

    index_sent_dict = {}
    for sentence_found in sentences_found:
        
        #we find index of each jumbled index and store it as dixtionary of {index:sentence}
        index_sent_dict.update({clean_sentence.index(sentence_found): sentence_found})
    
    #and then we sort the dictionary based on index and join the keys of the sorted dictionary

    sorted_dict = dict(sorted(index_sent_dict.items()))
    
    final_sentence = "".join(list(sorted_dict.values()))
    print(final_sentence)

    # The quick brown fox jumps over the lazy dog.
    # A wizard's job is to vex chumps quickly in fog.

Shreyesh Desai
  • 569
  • 4
  • 19