0

I have created an array of 4 strings, each string comprising of a paragraph from this link - https://www.imdb.com/title/tt0061852/plotsummary/. I want to create a program which shows the most relevant paragraph based on a query.

I am using text-embedding-ada-002 to create embeddings of each paragraph and then run cosine similarity between embeddings of a query and embeddings of the 4 strings. My expectation is that the most relevant paragraph will have highest cosine similarity value

{
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-embedding-ada-002",
      "owner": "organization-owner",
      "id": "textembeddingadamc002",
      "status": "succeeded",
      "created_at": 1682061799,
      "updated_at": 1682061799,
      "object": "deployment"
    }

I create an array

junglebooksummaries = [
    "Abandoned after an accident, baby Mowgli is taken and raised by a family of wolves. As the boy grows older, the wise panther Bagheera realizes he must be returned to his own kind in the nearby man-village. Baloo the bear, however, thinks differently, taking the young Mowgli under his wing and teaching him that living in the jungle is the best life there is. Bagheera realizes that Mowgli is in danger, particularly from Shere Khan the tiger who hates all people. When Baloo finally comes around, Mowgli runs off into the jungle where he survives a second encounter with Kaa the snake and finally, with Shere Khan. It's the sight of a pretty girl, however, that draws Mowgli to the nearby man-village and stay there.",
    "Based on Rudyard Kipling's book, Disney's \"The Jungle Book\", tells the story of how a young boy named Mowgli was raised by a pack of wolves, the young man-cub has adapted into the jungle and its surroundings quite well. Bagheera, a wise panther, must take Mowgli to the man-village to ensure his protection from the treacherous tiger, Shere Khan who wants to kill the boy in an effort to maintain balance within the jungle. Mowgli isn't ready to leave his home just yet and a fun-loving sloth bear named Baloo takes the man cub under his wing, but it proves to be an adventure for him and his best friend. Mowgli encounters King Louie, who wants to know how to make fire from the man-cub and he has a run in with Kaa, a hungry python who wants to eat the young man-cub. Mowgli must decide wether he belongs in the jungle or the man-village.",
    "Disney animation inspired by Rudyard Kipling's \"Mowgli\" story. Mowgli is a boy who has been raised by wolves in the Indian jungle. When the wolves hear that the fierce tiger, Shere Khan, is nearby, they decide to send Mowgli to a local \"man-village\". On his way to the village, Mowgli meets many animal characters in this musical tale. When Shere Khan learns of Mowgli's presence, he tracks him down.",
    """The story of \"The Jungle Book" concerns a young man-cub named Mowgli. A panther named Bagheera one day comes across an abandoned boat, in which a small baby is seen. Taking pity on the baby, Bagheera takes it to a small family of wolves, who adopt the boy.

10 years pass, and Mowgli has grown into a wiry young boy, who has long since been adopted into his wolf pack, despite his differences. However, word has reached the pack that the tiger Shere Khan has been spotted in the jungle. The pack knows of Khan's hatred of man, and wish to send Mowgli away for protection. Bagheera volunteers to take Mowgli to a man-village some distance away.

Mowgli and Bagheera set out sometime after dark. They stay in a tree for the night, but are disturbed by Kaa, a python with a hypnotic gaze, who tries to hypnotize both of them, before being pushed out of the tree by Mowgli.

The next day, they are awakened by 'The Dawn Patrol,' a pack of elephants led by Colonel Hathi. Mowgli spends a few moments with their son, who one day dreams of following in his Father's footsteps. Bagheera orders Mowgli to continue on their way to the man-village, but Mowgli refuses. After some struggles, Bagheera and Mowgli separate, fed up with the other's company.

As Mowgli sulks by a rock, he is suddenly discovered by Baloo, a large bear with a care-free attitude. Bagheera hears the commotion caused by the two of them, and returns, dismayed that Mowgli has encountered the 'jungle bum.' Baloo's 'philosophy' of living care-free in the jungle easily takes hold of the young man-cub, and Mowgli now wishes to stay with Baloo. However, a group of monkeys suddenly appear, and take Mowgli away.

Mowgli is taken to some ancient ruins, lorded over by an orangutan named King Louie, who figures since Mowgli is a man-cub, he can help him learn how to make fire. Bagheera and Baloo show up shortly, and after a fierce chase, get Mowgli away from King Louie.

As Mowgli rests from the ordeal, Bagheera explains to Baloo why Mowgli must leave the jungle, and after telling Baloo of the danger that Shere Khan poses to him, Baloo reluctantly agrees to take Mowgli back, even though he had promised Mowgli he could stay in the jungle with him. When Mowgli finds out about this, he runs off again.

After some time going through the jungle, Mowgli encounters Kaa, who hypnotizes the boy. Kaa is just about to eat Mowgli, when he is alerted to Shere Khan. Kaa manages to carry on a conversation with the tiger, and just barely hides the fact that the man-cub is nearby. Once Shere Khan leaves, Kaa's plans to eat Mowgli are foiled when Mowgli comes out of his trance, and is able to escape.

Sometime afterward, Mowgli chances upon a group of vultures, who are willing to take him in as one of their own. However, before they can do so, Shere Khan appears. Even though he is feared by the vultures, Mowgli refuses to run. Just as it seems that Shere Khan may devour Mowgli, Baloo appears, and wrestles with the tiger, who ends up clawing at the large bear. In the ensuing chaos, Mowgli ties a flaming branch to Shere Khan's tail, and the fire spooks the tiger, sending him running away.

Just when it appears that Baloo has died, he recovers from the ordeal. Bagheera soon joins the group, and the three of them set off back through the jungle.

It seems that Bagheera's plan to get Mowgli to the man-village has failed, when a beautiful song wafts through their ears. As the three of them look through some bushes, they see the man-village, and by a small stream, a little girl appears, gathering water. This intrigues Mowgli, who tries to go for a closer look. Seeing the boy, the girl pretends to spill her water jug. Mowgli retrieves it, refills it, but instead of taking it, the girl leads him back to the man-village, humming her 'siren song.' Baloo whispers for Mowgli to come back, but the boy follows the girl into the village.

Bagheera happily explains that Mowgli is now where he belongs, and Baloo accepts this fact, before wrapping an arm around the panther, and the two of them return to the jungle."""
]
df = pd.DataFrame(data=junglebooksummaries,columns=["summary"]) 
df

Normalise it

def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df['summary'] = df["summary"].apply(lambda x : normalize_text(x))

tokenise it

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
df['n_tokens'] = df["summary"].apply(lambda x: len(tokenizer.encode(x)))
df = df[df.n_tokens<2000]
len(df)

create embeddings

df['embeddings'] = df["summary"].apply(lambda x : get_embedding(x, engine = 'textembeddingadamc002'))
df

create embedding of a query

query = "how old is mowgli"
embeddings_of_query = get_embedding(query,engine='textembeddingadamc002')

Find cosine similarity

df['cosine_similarity_with_query'] = df["embeddings"].apply(lambda x : cosine_similarity(embeddings_of_query, x))
sorted_df = df.sort_values("cosine_similarity_with_query", ascending=False)
sorted_df

But the result doesn't show The story of "The Jungle Book" concerns a young .. as top paragraph even though only that paragraph has mention of year 10 years pass, and Mowgli has grown into a wiry young boy, ... Why?

enter image description here

Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
Manu Chadha
  • 15,555
  • 19
  • 91
  • 184

0 Answers0