Highest Voted 'huggingface-datasets' Questions

3

votes

1 answer

How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

I've extensively look over the internet, hugging face's (hf's) discuss forum & repo but found no end to end example of how to properly do ddp/distributed data parallel with HF (links at the end). This is what I need to be capable of running it end…

asked Aug 17 '22 at 15:35

Charlie Parker

5,884
57
198
323

3

votes

1 answer

How to load two pandas dataframe into hugginface's dataset object?

I am trying to load the train and test data frame into the dataset object. The usual way to load a pandas dataframe into dataset object is: from datasets import Dataset import pandas as pd df = pd.DataFrame({"a": [1, 2, 3]}) dataset =…

python pandas dataframe huggingface-transformers huggingface-datasets

asked Jun 04 '22 at 12:47

Aaditya Ura

12,007
7
50
88

3

votes

2 answers

How to Train Wav2vec2 XLSR With local Custom Dataset

I want to train a speech to text model with wav2vec2 xlsr (transformer-based model) in danish language, as a recommendation, many people train their model using common voice with the help of datasets library, but in common voice, there is very…

python machine-learning speech-recognition huggingface-transformers huggingface-datasets

asked May 18 '22 at 15:43

Siyam Fahad

61
7

3

votes

0 answers

Create custom data_collator for Huggingface Trainer

I need to create a custom data_collator for finetuning with Huggingface Trainer API. HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given probability. model_ckpt =…

python huggingface-transformers bert-language-model huggingface-tokenizers huggingface-datasets

asked Apr 19 '22 at 19:25

kkgarg

1,246
1
12
28

3

votes

0 answers

How to disable seqeval label formatting for POS-tagging

I am trying to evaluate my POS-tagger using huggingface's implementation of the seqeval metric but, since my tags are not made for NER, they are not formatted the way the library expects them. Consequently, when I try to read the results of my…

python nlp huggingface-transformers pos-tagger huggingface-datasets

asked Mar 02 '22 at 18:43

William A.

425
5
14

3

votes

1 answer

How to train a tokenizer on a big dataset?

Based on examples, I am trying to train a tokenizer and a model for T5 for Persian. I use Google Colab pro, when I tried to run the following code: import datasets from t5_tokenizer_model import SentencePieceUnigramTokenizer vocab_size =…

python huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Oct 02 '21 at 13:01

Ahmad

8,811
11
76
141

3

votes

3 answers

HUGGINGFACE TypeError: '>' not supported between instances of 'NoneType' and 'int'

I am working on Fine-Tuning Pretrained Model on custom (using HuggingFace) dataset I will copy all code correctly from the one youtube video everything is ok but in this cell/code: with training_args.strategy.scope(): …

deep-learning data-science huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Aug 21 '21 at 17:54

Shahid Khan

103
1
2
9

2

votes

1 answer

json2token not found when using the Donut VisionEncoderDecoderModel from Huggingface transformers

I am trying to fine-tune a Donut (Document Understanding) Huggingface Transformer model, but am getting hung up trying to create a DonutDataset object. I have the following code (running in google colab): !pip install transformers datasets…

python deep-learning pytorch huggingface-transformers huggingface-datasets

asked Jun 09 '23 at 03:58

Max Power

8,265
13
50
91

2

votes

1 answer

How to install rarfile and load the arabic_billion_words dataset from Huggingface dataset?

I'm encountering an error while trying to load a Hugging Face dataset that requires the rarfile library. I have already installed rarfile using pip install rarfile, but I'm still getting the same error. Here are the details of my environment,…

python nlp arabic rar huggingface-datasets

asked May 22 '23 at 14:47

Sanae

21
2

2

votes

1 answer

Can I convert an `IterableDataset` to ` Dataset`?

I want to load a large dataset, apply some transformations to some fields, sample a small section from the results and store as files so I can later on just load from there. Basically something like this: ds = datasets.load_dataset("XYZ",…

huggingface-datasets

asked May 11 '23 at 11:37

Zach Moshe

2,782
4
24
40

2

votes

1 answer

Not able to use map() or select(range()) with Huggingface Dataset library, gives dill_.dill has no attribute log

I'm not able to do dataset.map() or dataset.select(range(10)) with huggingface Datasets library in colab. It says dill_.dill has no attribute log I have tried with different dill versions, but no luck. I tried with older versions of dill lib but…

python nlp huggingface dill huggingface-datasets

asked Mar 15 '23 at 06:32

user21401461

41
3

2

votes

0 answers

How to fine-tune GPT2 text generation using Huggingface trainer API?

I'm farily new to machine learning, and am trying to figure out the Huggingface trainer API and their transformer library. My end use-case is to fine-tune a model like GODEL (or anything better than DialoGPT, really, which I managed to get working…

python machine-learning huggingface-transformers huggingface-datasets

asked Mar 06 '23 at 02:00

Evan Armstrong

21
3

2

votes

0 answers

Is there is a way that I can download only a part of the dataset from huggingface?

I'm trying to load (peoples speech) dataset, but it's way too big, is there's a way to download only a part of it? from datasets import load_dataset from datasets import load_dataset train = load_dataset("MLCommons/peoples_speech",…

dataset huggingface huggingface-datasets

asked Feb 17 '23 at 07:08

FOXASDF

43
3

2

votes

1 answer

Arrow related error when pushing dataset to Hugging-face hub

i have quite a problem with my dataset: The (future) dataset is a pandas dataframe that i loaded from a pickle file, the pandas dataset behaves correctly. My code is: dataset.from_pandas(df) dataset.push_to_hub("username/my_dataset",…

pandas parquet pyarrow huggingface huggingface-datasets

asked Jan 20 '23 at 09:21

Tsadoq

224
3
17

2

votes

0 answers

Your fast tokenizer does not have the necessary information to save the vocabulary for a slow tokenizer

I'm trying to fine tune a t5 model for paraphrasing Farsi sentences. I'm using this model as my base. My dataset is a paired sentence dataset which each row is a pair of paraphrased sentences. I want to fine tune the model on this dataset. The…

nlp google-colaboratory huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Nov 22 '22 at 09:06

Ali Ghasemi

61
2

Questions tagged [huggingface-datasets]