0

hello guys please i am in dying need of your help . i am trying to fine-tune the gpt2-meduim model with the hugging face transformer and i ran into this error just when i wanted to start the training "KeyError: 0" . here is my full code

import pandas as pd 
import numpy as np


dataset = pd.read_csv('Train_rev1.csv',error_bad_lines=False, engine='python')
# dataset.head(5)

def replace_string(row):
    row['FullDescription'] = row['FullDescription'].replace('****', str(row['SalaryNormalized']))
    return row

dataset = dataset.apply(replace_string, axis=1)
dataset = dataset.drop(['ContractType','ContractTime','LocationRaw','SalaryRaw','SourceName','Id','Title', 'LocationNormalized', 'Company', 'Category',
       'SalaryNormalized'], axis=1)

dataset.columns

! pip install -q transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenized_data = tokenizer(dataset['FullDescription'].tolist(), truncation=True, padding=True)

# Split data into training and validation sets
train_size = int(0.8 * len(tokenized_data['input_ids']))
val_size = len(tokenized_data['input_ids']) - train_size

train_dataset = {'input_ids': tokenized_data['input_ids'][:train_size],
                 'attention_mask': tokenized_data['attention_mask'][:train_size]}
val_dataset = {'input_ids': tokenized_data['input_ids'][train_size:],
               'attention_mask': tokenized_data['attention_mask'][train_size:]}

i beleive my error some how originates around this section

from transformers import GPT2Config
# Define model configuration and instantiate model
model_config = GPT2Config.from_pretrained('gpt2-medium')
model_config.output_hidden_states = True
model = GPT2LMHeadModel.from_pretrained('gpt2-medium', config=model_config)

# Train model using Huggingface Trainer API
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=50,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

my ide underlines this last statement and produces the the 'KeyError: 0' and it deos not provide me with any other detail about the error apart from

KeyError Traceback (most recent call last) in <cell line: 1>() ----> 1 trainer.train()

5 frames /usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py in (.0) 49 data = self.dataset.getitems(possibly_batched_index) 50 else: ---> 51 data = [self.dataset[idx] for idx in possibly_batched_index] 52 else: 53 data = self.dataset[possibly_batched_index]

KeyError: 0

i have tried changing some train_arguements but not working and am totally out of ideas as the error is not explicit

nkdtech
  • 33
  • 2
  • What is in `Train_rev1.csv`, could you provide a sample of it? – alvas Apr 11 '23 at 03:41
  • @alvas Train_rev1.csv i a dataset containing job descriptions it contains all the information about the job like [tittle , company name, salary, location, job description, job type etc. – nkdtech Apr 11 '23 at 21:01

0 Answers0