0

There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results. I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.

There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.

There is a line of code in the main file:

model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')

if args.mode=='train':
    model.train(data)
    sess = model.restore_last_session()
    model.predict(data, sess)
if args.mode=='test':
    sess = model.restore_last_session()
    model.predict(data, sess)

in which the 'data' is a class of Data(code) that includes test/train/dev datasets: which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?

data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
            './data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)

class Data(object):
    def __init__(self,data_path,vocab_path,pretrained,batch_size):
            self.batch_size = batch_size

            data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
            self.train=data['train']
            self.valid=data['valid']
            self.test=data['test']
            self.train2=data['train2']
            self.valid2=data['valid2']
            self.test2=data['test2']
            self.word_size = len(vocab['word2id'])+1
            self.max_sent_len = vocab['max_sent_len']
            self.max_topic_len = vocab['max_topic_len']
            self.word2id = vocab['word2id'] 
            word2id = vocab['word2id']                
            #self.id2word = dict((v, k) for k, v in word2id.iteritems())
            self.id2word = {}
            for k, v in six.iteritems(word2id):
                self.id2word[v]=k
            self.pretrained=pretrained
shahaf
  • 4,750
  • 2
  • 29
  • 32
Zahra Hnn
  • 149
  • 1
  • 3
  • 13
  • basically K-fold, meaning you need to run the train n (usually 10) times each time the test data is a different p% (usually 10%) of the whole population, because the data is integrated with the model (args to the `constructor`), your only option is to override/copy it's `train()` if you can post it here and also share what you did so far, could be much help – shahaf Oct 13 '19 at 16:30
  • @shahaf the train is here [link](https://github.com/jefferyYu/EMNLP18_codes/blob/master/ec_lstm_kl_att_topic_other%2Bsentiment_dasts2/ec_bilstm_kl_topic_self_att_dasts.py) in the middle of the page. If only we need to change test data, can I change the `self.test=data['test']` in class Data, instead of changing the train()? Thanks – Zahra Hnn Oct 14 '19 at 04:23

1 Answers1

0

by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)

so with a very minimal changes to existing code and libraries you can do smth like

first load all the data and build the model

data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
            './data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)

model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')

then create the cross validation data set, smth like

def get_new_data_object():
  return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
            './data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)

cross_validation = []
for i in range(10):
  tmp_data = get_new_data_object()
  tmp_data.train= #get 90% of tmp_data['train']
  tmp_data.valid= #get 90% of tmp_data['valid']
  tmp_data.test= #get 90% of tmp_data['test']
  tmp_data.train2= #get 90% of tmp_data['train2']
  tmp_data.valid2= #get 90% of tmp_data['valid2']
  tmp_data.test2= #get 90% of tmp_data['test2']
  cross_validation.append(tmp_data)

than run the model n times (10 for 10-fold cross validation)

sess = null
for data in cross_validation:
  model.train(data, sess)
  sess = model.restore_last_session()

keep in mind to pay attention to some key ideas

  • I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
  • the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
  • you can train the model n times with cross validation or create n models and pick the best to avoid overfitting

this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)

one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object

more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach

hope it helps, and point you in the right direction

shahaf
  • 4,750
  • 2
  • 29
  • 32
  • The author relied, as they didn't do 10 fold on this dataset (I guess ecause of complexity of data structure), and instead they performed model 10 times with different seeds. It was a great help. Thank you – Zahra Hnn Oct 20 '19 at 02:48