1

I have a DataFrame with text I want to tokenize using the Hugging Face library. When running the code, the "Tokenized Text" column returns empty. How can this be solved? The code is as follows:

df = pd.read_csv('subject_messages.csv')

import torch
from transformers import AutoTokenizer, AutoModel

model_ckpt = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
df["Tokenized_Text"] = tokenizer(df["Message"].to_list())
df.to_csv("tokenized_telegram_messages.csv", index=False)

I first thought I was not initializing the tokenizer correctly, but the model used is specifically trained for Spanish. The code should return a column with the tokenized text.

  • 1
    IIRC, you want to apply the tokenizer to each of the cell:`df['Tokenized_Text'] = df['Message'].apply(tokenizer)` – Quang Hoang Apr 27 '23 at 18:12

0 Answers0