I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes)
from sklearn.model_selection, but it doesn't work and I get error.
import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>
I try a work around method by converting the dataframe to numpy array:
classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()
Like that it works, but, I don't know if there is a way allowing the train_test_split
working correctly like in pandas dataframe.
Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df)
we get :
| CCC CCG CCU CCA CGC CGG CGU CGA CUC CUG … ---- + --- --- --- --- --- --- --- --- --- --- 0 | 0 0 0 0 2 0 1 0 0 1 … 1 | 0 0 0 0 1 0 2 1 0 1 … 2 | 0 0 0 1 1 0 1 0 1 2 … 3 | 0 0 0 1 1 0 1 0 1 2 … 4 | 0 0 0 1 1 0 1 0 1 2 … 5 | 0 0 0 1 1 0 1 0 1 2 … 6 | 0 0 0 1 0 0 3 0 0 2 … 7 | 0 0 0 1 1 0 0 0 1 2 … 8 | 0 0 0 1 1 0 1 0 1 2 … 9 | 0 0 1 0 1 0 1 0 1 3 … 10 | 0 0 1 0 1 0 1 0 1 3 … ...
Thanks for you help.