How to split datatable dataframe into train and test dataset in python

Question

I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes) from sklearn.model_selection, but it doesn't work and I get error.

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split

dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>

I try a work around method by converting the dataframe to numpy array:

classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()

Like that it works, but, I don't know if there is a way allowing the train_test_split working correctly like in pandas dataframe.

Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df) we get :

     | CCC  CCG  CCU  CCA  CGC  CGG  CGU  CGA  CUC  CUG  …  
---- + ---  ---  ---  ---  ---  ---  ---  ---  ---  ---     
   0 |   0    0    0    0    2    0    1    0    0    1  …  
   1 |   0    0    0    0    1    0    2    1    0    1  …  
   2 |   0    0    0    1    1    0    1    0    1    2  …  
   3 |   0    0    0    1    1    0    1    0    1    2  …  
   4 |   0    0    0    1    1    0    1    0    1    2  …  
   5 |   0    0    0    1    1    0    1    0    1    2  …  
   6 |   0    0    0    1    0    0    3    0    0    2  …  
   7 |   0    0    0    1    1    0    0    0    1    2  …  
   8 |   0    0    0    1    1    0    1    0    1    2  …  
   9 |   0    0    1    0    1    0    1    0    1    3  …  
  10 |   0    0    1    0    1    0    1    0    1    3  …  
      ...

Thanks for you help.

adding sample data would clear our picture on what column you are looking to make you IV and DV — The AG, Jul 21 '20 at 19:55
Thank you for your comment :) , The csv file contain as columns strings, and the values are unsigned int. — ibra, Jul 21 '20 at 20:08

score 7 · Answer 1 · answered Jan 04 '22 at 01:44

Here is a simple function I made using only pandas. The sample function randomly and uniformly selects rows (axis=0) in the dataframe for the test set. The rows for the training set can be selected by dropping the rows in the original dataframe with the same indexes as the test set.

def train_test_split(df, frac=0.2):
    
    # get random sample 
    test = df.sample(frac=frac, axis=0)

    # get everything but the test sample
    train = df.drop(index=test.index)

    return train, test

score 1 · Answer 2 · answered Jul 21 '20 at 20:06

1

i don't know about a function that can split dt. but you can us

dt_df = df.read_csv(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

and then convert the DataFame to DataTable by:

X_train = dt.Frame(X_train)
X_test = dt.Frame(X_test)

answered Jul 21 '20 at 20:06

Manoor Hassan

11
1

Thank you for your response. In fact I must use datatable to load the csv file instead of panda read_csv, because i deal with huge files, where pandas read_csv take a lot of times (for example 34 minutes), and datatable fread() take only 40 second. – ibra Jul 21 '20 at 20:16
is converting back and forth between the two will solve your issue?. using dataTable for loading then converting to DF to split then return back – Manoor Hassan Jul 21 '20 at 20:20
yes yes, for now i use solution like that, as i said in my post i convert the datatable to numpy array and the split work fine. My goal now is searching (if that exist of course) a way that work similarly to pandas dataframe, directly without convert. Thank you very much for your help @Manoor Hassan – ibra Jul 21 '20 at 20:28

ibra · Accepted Answer · 2020-07-23T13:01:23.017

The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes) from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):

source code before split method:

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

dt_df = dt.fread(csv_file_path)

classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])

source code after split method:

ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)

method 1: convert to numpy

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

method 2: convert to numpy and return back to datatable dataframe after the split:

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

X_train = dt.Frame(X_train)

# source code after split method

method 3: convert to pandas dataframe

# source code before split method

dt_df = dt_df.to_pandas()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:

                       T convert    T.train     T.pred
M1 to_numpy             3           85          0.5
M2 to_numpy and back    3.5         29          0.5
M3 to pandas            4           37          4

How to split datatable dataframe into train and test dataset in python

3 Answers3