0

I have written a script that takes embeddings, ids, and metadata from a data table and upserts it to a pinecone index. I have confirmed multiple times that my data structure is correct. I am passing a list of tuples and my embedding vectors are a list and not an array. Yet no matter what I try i get errors when trying to upsert data.

Actual error pinecone_package.indexing.index - ERROR - Error upserting vectors to index: tunl-vision, Error: object of type 'NoneType' has no len()

My function for processing data from a dataframe

def process_batch(df, model, processor, tokenizer, indexer, s3_client, bucket_name):
    """
    Process a batch of images: generate embeddings, upsert to Pinecone, and upload to S3.
    """
    try:
        # Check if image URLs are valid
        df['is_valid'] = df['image'].apply(check_valid_urls)
        valid_df = df[df['is_valid']]

        # Get embeddings
        valid_df['image_embeddings'] = valid_df['image'].apply(lambda url: get_single_image_embedding(get_image(url), processor, model))
        valid_df['text_embeddings'] = valid_df['description'].apply(lambda text: get_single_text_embedding(text, tokenizer, model))

        # Convert embeddings to lists
        for col in ['image_embeddings', 'text_embeddings']:
            valid_df[col] = valid_df[col].apply(lambda x: x[0].tolist() if isinstance(x, np.ndarray) and x.ndim > 1 else x.tolist())

        # Upsert to Pinecone
        item_ids = valid_df['id'].tolist()
        vectors = valid_df['image_embeddings'].tolist()
        metadata = valid_df.drop(columns=['id', 'is_valid', 'image_embeddings', 'text_embeddings', 'size']).to_dict(orient='records')

        data_to_upsert = list(zip(item_ids, vectors, metadata))
        indexer.upsert_vectors(data_to_upsert)

        # Preprocess images and upload to S3
        for url in valid_df['image']:
            preprocess_and_upload_image(s3_client, bucket_name, url)

        logging.info("Successfully processed batch.")
    except Exception as e:
        logging.error(f"Error processing batch: {str(e)}")

My actual upsert function (apart of a class that initializes pinecone when it is called)

    def upsert_vectors(self, data: List[Tuple[str, List[float], Dict]]) -> None:
        """
        Upsert vectors to the Pinecone index.

        Parameters
        ----------
        data : List[Tuple[str, List[float], Dict]]
            List of tuples, each containing an item ID, a vector, and a dictionary of metadata.

        Raises
        ------
        Exception
            If there is an error in upserting the vectors.
        """
        try:
            # Print the first 5 data points
            self.logger.info(f'First 5 data points: {data[:5]}')

            # Check if data is a list of tuples
            if not all(isinstance(i, tuple) and len(i) == 3 for i in data):
                self.logger.error(f'Data is not in correct format: {data}')
                return

            # Check if all IDs, vectors, and metadata are non-empty
            for item_id, vector, meta in data:
                if not item_id or not vector or not meta:
                    self.logger.error(f'Found empty or None data: ID={item_id}, Vector={vector}, Meta={meta}')
                    return

            upsert_result = self.index.upsert(vectors=data)
            self.logger.info(
                f'Successfully upserted {len(upsert_result.upserted_ids)} vectors to index: {self.index_name}')
        except Exception as e:
            self.logger.error(f'Error upserting vectors to index: {self.index_name}, Error: {str(e)}')

A snippet of the data prior to upsert confirms my structure is correct.

[('62a0be4d5ce2f83f3931a452-00664382372027', [0.2172567993402481, 0.05793587118387222, 0.1606423407793045, 0.3030063211917877,...], , {'image': 'https://athleta.gap.com/webcontent/0014/560/754/cn14560754.jpg', 'merchant': '62a0b5535ce2f83f392ec994', 'brand': ''...})

  • 1
    it looks like you are passing in an empty variable, but it is hard to verify with the code given. If you can reduce to a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) then it will be easier to identify... – D.L Jul 28 '23 at 09:29
  • how do you create the argument `indexer` in `process-batch`? – Yilmaz Aug 03 '23 at 04:24

0 Answers0