-2

I want to create a search utility where we use text within powerpoint slides and image of each slide in a powerpoint. Upon entering a search term the application should return the k most relevant slide images. For this i have explored Pinecone's ecommerce hybrid search implementation which uses sparse embeddings (BM25 model) for text and dense embeddings (CLIP model) for images -> https://docs.pinecone.io/docs/ecommerce-search

Problem with that is its tailored for ecommerce search where input tokens are shorter in length as compared to text from powerpoints. Also, with powerpoints the text is pretty unstructured.

My current data preparation steps:

  1. Convert PPTX file to pdf using win32com
  2. Convert each page in a pdf to png files using pypdfium2
  3. Extract text from pdfs using pypdfium2
  4. Splitting the text into chunks of 77 words (as thats the max the models which pineocone uses for its implementation can take)
  5. Create dataframe where every chunk of text corresponds to the relevant slide image (PIL format)

Could there be another way of doing this?

Danish Zahid Malik
  • 541
  • 2
  • 7
  • 19
  • Welcome to Stackoverflow! Asking for recommendations might not be appropriate on the Stackoverflow (https://stackoverflow.com/help/how-to-ask) but it might be possible to ask the question on https://softwarerecs.stackexchange.com/ – alvas Aug 25 '23 at 16:21

0 Answers0