0

I have a list of DataFrames that I want to split into train and test sets. For a single DataFrame, I could do the following,

Get the length of test split

split_point = len(df)- 125

and then,

train, test = df[0:split_point], df[split_point:]

This gives me the train and test split.

Now, for list of DataFrames I could get test set length for each DataFrame using,

split_point = [len(df)-125 for df in dfs]  ## THIS WORKS FINE

I want to get the train and test split for the whole list of dataframes as I have done for single dataframe. I tried the following,

train, test = [(df[0:split_point], df[split_point:]) for df in dfs]

## AND THE FOLLOWING

train, test = [(df[0:split_point] for df in dfs),(df[split_point:]) for df in dfs]

Both are not working. How can I do this?

(Some of the DataFrame's length might differ, but I am not worried about it as it will substract the 125 from the length, which I am considering for test set)

i.n.n.m
  • 2,936
  • 7
  • 27
  • 51
  • 1
    Your `split_point` is a list in your second case, why not just `list_of_trains, list_of_tests = zip(*[(df[0:len(df)-125], df[len(df)-125:]) for df in dfs])`? – Psidom Jul 19 '17 at 21:13

1 Answers1

1

You need to do

train, test = zip(*[(dfs[i][0:split_point[i]], dfs[i][split_point[i]:]) for i in range(len(dfs))])

Then each one of them would be a tuple with the corresponding parts of the data frames.

In the above code I am using

split_point = [len(df)-125 for df in dfs]

Just to make it more clear, consider the following more simple example:

r = [(i,i**2) for i in range(5)]
a,b=zip(*r)

Then a is (0, 1, 2, 3, 4) and b is (0, 1, 4, 9, 16).

Miriam Farber
  • 18,986
  • 14
  • 61
  • 76