DataFrame text tokenization with Hugging Face is not working

Asked Apr 27 '23 at 17:42

Active Apr 27 '23 at 17:42

Viewed 72 times

I have a DataFrame with text I want to tokenize using the Hugging Face library. When running the code, the "Tokenized Text" column returns empty. How can this be solved? The code is as follows:

df = pd.read_csv('subject_messages.csv')

import torch
from transformers import AutoTokenizer, AutoModel

model_ckpt = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
df["Tokenized_Text"] = tokenizer(df["Message"].to_list())
df.to_csv("tokenized_telegram_messages.csv", index=False)

I first thought I was not initializing the tokenizer correctly, but the model used is specifically trained for Spanish. The code should return a column with the tokenized text.

asked Apr 27 '23 at 17:42

Mark Davidson

1

IIRC, you want to apply the tokenizer to each of the cell:`df['Tokenized_Text'] = df['Message'].apply(tokenizer)` – Quang Hoang Apr 27 '23 at 18:12

DataFrame text tokenization with Hugging Face is not working

0 Answers0