Use this tag for questions related to the datasets project from Hugging Face. [Project on GitHub][1] [1]: https://github.com/huggingface/datasets
Questions tagged [huggingface-datasets]
221 questions
3
votes
1 answer
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?
I've extensively look over the internet, hugging face's (hf's) discuss forum & repo but found no end to end example of how to properly do ddp/distributed data parallel with HF (links at the end).
This is what I need to be capable of running it end…

Charlie Parker
- 5,884
- 57
- 198
- 323
3
votes
1 answer
How to load two pandas dataframe into hugginface's dataset object?
I am trying to load the train and test data frame into the dataset object. The usual way to load a pandas dataframe into dataset object is:
from datasets import Dataset
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
dataset =…

Aaditya Ura
- 12,007
- 7
- 50
- 88
3
votes
2 answers
How to Train Wav2vec2 XLSR With local Custom Dataset
I want to train a speech to text model with wav2vec2 xlsr (transformer-based model) in danish language, as a recommendation, many people train their model using common voice with the help of datasets library, but in common voice, there is very…

Siyam Fahad
- 61
- 7
3
votes
0 answers
Create custom data_collator for Huggingface Trainer
I need to create a custom data_collator for finetuning with Huggingface Trainer API.
HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given probability.
model_ckpt =…

kkgarg
- 1,246
- 1
- 12
- 28
3
votes
0 answers
How to disable seqeval label formatting for POS-tagging
I am trying to evaluate my POS-tagger using huggingface's implementation of the seqeval metric but, since my tags are not made for NER, they are not formatted the way the library expects them. Consequently, when I try to read the results of my…

William A.
- 425
- 5
- 14
3
votes
1 answer
How to train a tokenizer on a big dataset?
Based on examples, I am trying to train a tokenizer and a model for T5 for Persian.
I use Google Colab pro,
when I tried to run the following code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size =…

Ahmad
- 8,811
- 11
- 76
- 141
3
votes
3 answers
HUGGINGFACE TypeError: '>' not supported between instances of 'NoneType' and 'int'
I am working on Fine-Tuning Pretrained Model on custom (using HuggingFace) dataset I will copy all code correctly from the one youtube video everything is ok but in this cell/code:
with training_args.strategy.scope():
…

Shahid Khan
- 103
- 1
- 2
- 9
2
votes
1 answer
json2token not found when using the Donut VisionEncoderDecoderModel from Huggingface transformers
I am trying to fine-tune a Donut (Document Understanding) Huggingface Transformer model, but am getting hung up trying to create a DonutDataset object. I have the following code (running in google colab):
!pip install transformers datasets…

Max Power
- 8,265
- 13
- 50
- 91
2
votes
1 answer
How to install rarfile and load the arabic_billion_words dataset from Huggingface dataset?
I'm encountering an error while trying to load a Hugging Face dataset that requires the rarfile library. I have already installed rarfile using pip install rarfile, but I'm still getting the same error.
Here are the details of my environment,…

Sanae
- 21
- 2
2
votes
1 answer
Can I convert an `IterableDataset` to ` Dataset`?
I want to load a large dataset, apply some transformations to some fields, sample a small section from the results and store as files so I can later on just load from there.
Basically something like this:
ds = datasets.load_dataset("XYZ",…

Zach Moshe
- 2,782
- 4
- 24
- 40
2
votes
1 answer
Not able to use map() or select(range()) with Huggingface Dataset library, gives dill_.dill has no attribute log
I'm not able to do dataset.map() or dataset.select(range(10)) with huggingface Datasets library in colab. It says dill_.dill has no attribute log
I have tried with different dill versions, but no luck.
I tried with older versions of dill lib but…

user21401461
- 41
- 3
2
votes
0 answers
How to fine-tune GPT2 text generation using Huggingface trainer API?
I'm farily new to machine learning, and am trying to figure out the Huggingface trainer API and their transformer library. My end use-case is to fine-tune a model like GODEL (or anything better than DialoGPT, really, which I managed to get working…

Evan Armstrong
- 21
- 3
2
votes
0 answers
Is there is a way that I can download only a part of the dataset from huggingface?
I'm trying to load (peoples speech) dataset, but it's way too big, is there's a way to download only a part of it?
from datasets import load_dataset
from datasets import load_dataset
train = load_dataset("MLCommons/peoples_speech",…

FOXASDF
- 43
- 3
2
votes
1 answer
Arrow related error when pushing dataset to Hugging-face hub
i have quite a problem with my dataset:
The (future) dataset is a pandas dataframe that i loaded from a pickle file, the pandas dataset behaves correctly. My code is:
dataset.from_pandas(df)
dataset.push_to_hub("username/my_dataset",…

Tsadoq
- 224
- 3
- 17
2
votes
0 answers
Your fast tokenizer does not have the necessary information to save the vocabulary for a slow tokenizer
I'm trying to fine tune a t5 model for paraphrasing Farsi sentences. I'm using this model as my base. My dataset is a paired sentence dataset which each row is a pair of paraphrased sentences. I want to fine tune the model on this dataset. The…

Ali Ghasemi
- 61
- 2