0

I am trying to predict sentiment for 20 million records using the model available in Hugging Face.

https://huggingface.co/finiteautomata/beto-sentiment-analysis

This model takes 1 hour and 20 minutes to predict 70000 records.

The model is saved locally and accessed locally by loading it.

Anyone can please suggest how I can efficiently use it to predict 20 million records in a minimum time.

Also, I am using the Zero-Shot Classification Model on the same data it is taking taking

7 minutes to predict for 1000 records.

Kindly suggest for this as well if any way to predict in minimum time.

model_path = 'path where model is saved'
from transformers import pipeline
classifier = pipeline("zero-shot-classification", 
                       model="Recognai/bert-base-spanish-wwm-cased-xnli")
   
def predict(row):
    topics = # five candidate labels here
    res = classifier(row, topics)
    return res

df['Predict'] = df['Message'].apply(lambda x: predict_crash(x)) # This df contains 70k records
Kumar
  • 125
  • 1
  • 9
  • To do this in minimum time, you'd have to parallelize the task and use a cluster of powerful mullticore machines. With your current setup it's going to take approximately 20 days (around 100 days for the 2nd model). Don't expect miracles: to run a complex model fast on some huge data you need costly hardware. – Erwan May 10 '22 at 17:20
  • Thanks, Erwan for your suggestion, Could you please confirm if we have any other similar pre-trained model which can achieve this task early. – Kumar May 10 '22 at 18:50
  • Sure, there are many to choose from. Popular ones: [nltk](https://realpython.com/python-nltk-sentiment-analysis/), [Spacy/textblob](https://spacy.io/universe/project/spacy-textblob). – Erwan May 10 '22 at 19:59
  • Thanks for the information. I am seeking something which can do Zero-Shot Classification. for sentiment we have many models. – Kumar May 10 '22 at 23:23
  • Imho as long as you're using complex DL models it's going to require a lot of time and/or computing power for your data. The libraries I mentioned use more simple methods, they're possibly less accurate but faster. – Erwan May 11 '22 at 13:01

0 Answers0