0

I am totally new to all of this. I have just started using huggingface, and I am trying to use the DistilBert model. I was following along a textbook that shows how to tokenize and then run it through DistilBert model. The dataset they used was one of the Huggingface hub's datasets. I was able to replicate what I saw just fine with their dataset.

Now I am trying to use my own dataset and receive the error

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

but if I add is_split_into_words=True, it turns the error message to

PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]

I've spent the last several days trying to troubleshoot this error, including looking at others who have gotten this on this website but none looked similar to mine and combing through the guides and courses on huggingface. None has been helpful. I'm using jupyter notebooks in Google colab. Below is my code:

def tokenize(batch):
  return tokenizer(batch["content"], truncation=True, padding=True, is_split_into_words=True, return_tensors="pt")

print(tokenize(reviews["train"][:2]))

reviews_encoded = reviews.map(tokenize, batched=True, batch_size=None)

Thank you so much, any help is greatly appreciated.

Theodor Peifer
  • 3,097
  • 4
  • 17
  • 30
  • What's the type of your `batch["content"]`? When you're setting the `is_split_into_words` to True, you need to provide a list of tokens (words) for each input example. I suspect if you're doing that prior to sending your dataset to the tokenizer. – inverted_index Aug 28 '22 at 14:41
  • Thanks for the response. Originally I didn't put is_split_into_words in my code at all, and I received the error and thought I could fix it since it is mentioned in the error message. The batch data type are strings. It's 1.4 million google reviews I scraped earlier in the year. – Bête Noire Aug 28 '22 at 16:17
  • Got it. I believe `batch["content"]` is a list whose each element is of type string. If you wanna tokenize this input, you don't need to set `is_split_into_words=True`. However, if you're setting it to True, then you need to make your `batch['content']` a list of lists, where the inner list included the words/tokens that you have already tokenized in other ways. – inverted_index Aug 28 '22 at 16:46

0 Answers0