0

Setup

I have Anaconda virtual environment on a Windows machine. Torch, transformers, tensorflow and CUDA installed. I previously used GPU acceleration from the transformers pipeline.

What I want to do ultimately

I want to use BERT to take word embeddings of the text in my dataset, and input that in LDA to do topic modeling. The pseudo-code I intend to run:

import pandas as pd
import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel

# Load your dataset into a pandas dataframe
df = pd.read_csv("topic_modeling_input_dataset.csv")

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the reviews in the dataframe
df["tokenized_reviews"] = df["review"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

# Convert the tokenized reviews to tensors
input_ids = tf.constant(list(df["tokenized_reviews"]))

# Extract the word embeddings using the pre-trained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")
_, word_embeddings = bert_model(input_ids)

# Convert the word embeddings from tensors to numpy arrays
word_embeddings = word_embeddings.numpy()

# Average the word embeddings for each review to obtain sentence embeddings
sentence_embeddings = np.mean(word_embeddings, axis=1)

# Use the sentence embeddings as input to Latent Dirichlet Allocation (LDA) for topic modeling
from sklearn.decomposition import LatentDirichletAllocation

# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=10)

# Fit the LDA model on the sentence embeddings
lda_model.fit(sentence_embeddings)

# Print the topics learned by the LDA model
for index, topic in enumerate(lda_model.components_):
    print(f"Topic {index}:")
    words = [tokenizer.convert_ids_to_tokens[i] for i in np.argsort(topic)[::-1][:10]]
    print(words)

But can't get past through importing the libraries

Problem

The command from transformers import BertTokenizer, TFBertModel gives the error:

RuntimeError: Failed to import transformers.models.bert.modeling_tf_bert because of the following error (look up to see its traceback):
Failed to import transformers.data.data_collator because of the following error (look up to see its traceback):
[WinError 182] The operating system cannot run %1. Error loading "C:\Users\myuser\Anaconda3\envs\text_mining\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.

Debugging Attempt

In the directory, I only have caffe2_detectron_ops_gpu.dll and no caffe2_detectron_ops.dll, which was the problem in all reported cases I read online.
I also tried reinstalling caffe2 in conda, but can't get a clean command or way to do it. caffe2 documentation mentions that the commands could have unresolved bugs.

0 Answers0