Use this tag for questions related to the datasets project from Hugging Face. [Project on GitHub][1] [1]: https://github.com/huggingface/datasets
Questions tagged [huggingface-datasets]
221 questions
2
votes
0 answers
Unable to create tensor
I am trying to train an NLP model for MLM problem, but the trainer.train function is throwing:
Unable to create tensor, you should probably activate truncation
and/or padding with 'padding=True' 'truncation=True' to have batched
tensors with the…

Harel Moshayof
- 33
- 5
2
votes
1 answer
Is there a way to call Macro-Precision in Hugging Face Trainer?
I'm currently making tests on the DEFT-2015 dataset using Hugging Face models. I would like to compare my results to what has been done.
I checked in the list_metrics method from the datasets library, but I did not see Macro Precision, which was the…

Eliott Thomas
- 23
- 4
2
votes
0 answers
How to fine tune a masked language model?
I'm trying to follow the huggingface tutorial on fine tuning a masked language model (masking a set of words randomly and predicting them). But they assume that the dataset is in their system (can load it with from datasets import load_dataset;…

Penguin
- 1,923
- 3
- 21
- 51
2
votes
1 answer
Calculate precision, recall, f1 score for custom dataset for multiclass classification Huggingface library
I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and was trying to see the f1 score and accuracy.
I…

Alex Kujur
- 121
- 6
2
votes
1 answer
HuggingFace Datasets to PyTorch
I want to load the dataset from Hugging face, convert it to PYtorch Dataloader. Here is my script.
dataset = load_dataset('cats_vs_dogs', split='train[:1000]')
trans = transforms.Compose([transforms.Resize((256,256)), transforms.PILToTensor()])
def…

Ahmad Anis
- 2,322
- 4
- 25
- 54
2
votes
2 answers
Implement custom Huggingface dataset with data downloaded from s3
In order to implement a custom Huggingface dataset I need to implement three methods:
from datasets import DatasetBuilder, DownloadManager
class MyDataset(DatasetBuilder):
def _info(self):
...
def _split_generator(self, dl_manager:…

Riccardo Bucco
- 13,980
- 4
- 22
- 50
2
votes
0 answers
How long does load_dataset take time in huggingface?
I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a…

Ahmad
- 8,811
- 11
- 76
- 141
2
votes
1 answer
How to load a dataset in streaming mode on Google Colab?
I am trying to save some disk space to use the CommonVoice French dataset (19G) on Google Colab as my Notebook always crashes out of disk space. I saw that from the HuggingFace documentation that we can load a dataset in a streaming mode so we can…

Mymozaaa
- 474
- 4
- 20
1
vote
2 answers
How to load a huggingface dataset from local path?
Take a simple example in this website, https://huggingface.co/datasets/Dahoas/rm-static:
if I want to load this dataset online, I just directly use,
from datasets import load_dataset
dataset = load_dataset("Dahoas/rm-static")
What if I want to…

4daJKong
- 1,825
- 9
- 21
1
vote
1 answer
AttributeError: ‘Dataset’ object has no attribute ‘remove_columns’ in hugging face
I want to remove column from Dataset Billsum from hugging face.
Error:
AttributeError: ‘Dataset’ object has no attribute ‘remove_columns’
I can't find any solution for this problem.. I have a headache from it
If someone can help me

jack wilson
- 25
- 3
1
vote
1 answer
How to download data from hugging face that is visible on the data viewer but the files are not available?
I can see them (data set link hf: https://huggingface.co/datasets/EleutherAI/pile/) :
but no matter how I change the download url I can't get the data. Files are not there and their script doesn't work.
Anyone know how to get the splits and know…

Charlie Parker
- 5,884
- 57
- 198
- 323
1
vote
0 answers
DatasetGenerationError: An error occurred while generating the dataset
Im trying to load my Publaynet dataset from s3 bucket to data bricks using huggingface datasets like this:
dataset_id = "/dbfs/mnt/ocr/dataset/publaynet"
dataset = load_dataset(dataset_id, data_files={"train":…

hima sai
- 95
- 1
- 11
1
vote
1 answer
Haystack: PromptNode takes too much time to load the model
I use the below code based on the tutorials from Haystack:
lfqa_prompt = PromptTemplate("deepset/question-answering-with-references", output_parser=AnswerParser(reference_pattern=r"Document\[(\d+)\]"))
prompt_node =…

user3164187
- 1,382
- 3
- 19
- 50
1
vote
0 answers
Why T5 can only generate sentences of length 20. Can someone help me? I wish I could generate longer sentences
from datasets import load_dataset
books = load_dataset('higashi1/mymulti30k', "en-de")
from transformers import AutoTokenizer
#checkpoint = "./logs/"
checkpoint = "t5-base"
tokenizer =…

HIKARI
- 11
- 1
1
vote
1 answer
CPU Out of memory when training a model with pytorch lightning
I am trying to train a BERT model on my data using the Trainer class from pytorch-lightning. However, I encountered an out-of-memory exception in the CPU memory.
Here is the code:
from transformers.data.data_collator import…

Ofir
- 590
- 9
- 19