Setup
I have Anaconda virtual environment on a Windows machine. Torch, transformers, tensorflow and CUDA installed. I previously used GPU acceleration from the transformers pipeline.
What I want to do ultimately
I want to use BERT to take word embeddings of the text in my dataset, and input that in LDA to do topic modeling. The pseudo-code I intend to run:
import pandas as pd
import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel
# Load your dataset into a pandas dataframe
df = pd.read_csv("topic_modeling_input_dataset.csv")
# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the reviews in the dataframe
df["tokenized_reviews"] = df["review"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
# Convert the tokenized reviews to tensors
input_ids = tf.constant(list(df["tokenized_reviews"]))
# Extract the word embeddings using the pre-trained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")
_, word_embeddings = bert_model(input_ids)
# Convert the word embeddings from tensors to numpy arrays
word_embeddings = word_embeddings.numpy()
# Average the word embeddings for each review to obtain sentence embeddings
sentence_embeddings = np.mean(word_embeddings, axis=1)
# Use the sentence embeddings as input to Latent Dirichlet Allocation (LDA) for topic modeling
from sklearn.decomposition import LatentDirichletAllocation
# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=10)
# Fit the LDA model on the sentence embeddings
lda_model.fit(sentence_embeddings)
# Print the topics learned by the LDA model
for index, topic in enumerate(lda_model.components_):
print(f"Topic {index}:")
words = [tokenizer.convert_ids_to_tokens[i] for i in np.argsort(topic)[::-1][:10]]
print(words)
But can't get past through importing the libraries
Problem
The command
from transformers import BertTokenizer, TFBertModel
gives the error:
RuntimeError: Failed to import transformers.models.bert.modeling_tf_bert because of the following error (look up to see its traceback):
Failed to import transformers.data.data_collator because of the following error (look up to see its traceback):
[WinError 182] The operating system cannot run %1. Error loading "C:\Users\myuser\Anaconda3\envs\text_mining\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Debugging Attempt
In the directory, I only have caffe2_detectron_ops_gpu.dll
and no caffe2_detectron_ops.dll
, which was the problem in all reported cases I read online.
I also tried reinstalling caffe2 in conda, but can't get a clean command or way to do it. caffe2 documentation mentions that the commands could have unresolved bugs.