Questions tagged [huggingface-transformers]

Transformers is a Python library that implements various transformer NLP models in PyTorch and Tensorflow.

transformers is a natural language processing (NLP) library that implements many state-of-the-art transformer models in Python using PyTorch and TensorFlow. It is created and maintained by Hugging Face. The library is available through package managers, and it is open-sourced on GitHub. The library was formerly known as pytorch-transformers and before that as pytorch-pretrained-bert.

2878 questions
64
votes
6 answers

Where does hugging face's transformers save models?

Running the below code downloads a model - does anyone know what folder it downloads it to? !pip install -q transformers from transformers import pipeline model = pipeline('fill-mask')
user3472360
  • 1,337
  • 1
  • 16
  • 29
59
votes
4 answers

How to change huggingface transformers default cache directory

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.
Ivan Lee
  • 3,420
  • 4
  • 30
  • 45
52
votes
6 answers

Load a pre-trained model from disk with Huggingface Transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax: - a path to a `directory` containing vocabulary files required by the…
Mittenchops
  • 18,633
  • 33
  • 128
  • 246
41
votes
5 answers

How to disable TOKENIZERS_PARALLELISM=(true | false) warning?

I use pytorch to train huggingface-transformers model, but every epoch, always output the warning: The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set…
38
votes
5 answers

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() test_texts,…
37
votes
2 answers

what's difference between tokenizer.encode and tokenizer.encode_plus in Hugging Face

Here is an example of doing sequence classification using a model to determine if two sequences are paraphrases of each other. The two examples give two different results. Can you help me explain why tokenizer.encode and tokenizer.encode_plus give…
andy
  • 1,951
  • 5
  • 16
  • 30
31
votes
4 answers

Transformers v4.x: Convert slow tokenizer to fast tokenizer

I'm following the transformer's pretrained model xlm-roberta-large-xnli example from transformers import pipeline classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli") and I get the…
31
votes
1 answer

How to use 'collate_fn' with dataloaders?

I am trying to train a pretrained roberta model using 3 inputs, 3 input_masks and a label as tensors of my training dataset. I do this using the following code: from torch.utils.data import TensorDataset, DataLoader, RandomSampler,…
Sam V
  • 479
  • 1
  • 4
  • 11
28
votes
7 answers

How to download model from huggingface?

https://huggingface.co/models For example, I want to download 'bert-base-uncased', but cann't find a 'Download' link. Please help. Or is it not downloadable?
marlon
  • 6,029
  • 8
  • 42
  • 76
27
votes
3 answers

How to build semantic search for a given domain

There is a problem we are trying to solve where we want to do a semantic search on our set of data, i.e we have a domain-specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a…
26
votes
5 answers

How to compare sentence similarities using embeddings from BERT

I am using the HuggingFace Transformers package to access pretrained models. As my use case needs functionality for both English and Arabic, I am using the bert-base-multilingual-cased pretrained model. I need to be able to compare the similarity of…
KOB
  • 4,084
  • 9
  • 44
  • 88
25
votes
3 answers

Huggingface saving tokenizer

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet. BASE_MODEL = "distilbert-base-multilingual-cased" tokenizer =…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
24
votes
2 answers

Saving and reload huggingface fine-tuned transformer

I am trying to reload a fine-tuned DistilBertForTokenClassification model. I am using transformers 3.4.0 and pytorch version 1.6.0+cu101. After using the Trainer to train the downloaded model, I save the model with trainer.save_model() and in my…
Nate
  • 241
  • 1
  • 2
  • 4
23
votes
3 answers

Add dense layer on top of Huggingface BERT model

I want to add a dense layer on top of the bare BERT Model transformer outputting raw hidden-states, and then fine tune the resulting model. Specifically, I am using this base model. This is what the model should do: Encode the sentence (a vector…
22
votes
2 answers

How to free GPU memory in PyTorch

I have a list of sentences I'm trying to calculate perplexity for, using several models using this code: from transformers import AutoModelForMaskedLM, AutoTokenizer import torch import numpy as np model_name = 'cointegrated/rubert-tiny' model =…
Penguin
  • 1,923
  • 3
  • 21
  • 51
1
2 3
99 100