4

I'm using Pycaret classification to do some machine learning with my >1 million of data (this includes 18 categorical and 1 numerical features). Pandas Dataframe is storing the data pulled from Oracle database. These steps take about 2-3 minutes. When my data is being preprocessed, it's taking >7 hours. Is there a way to improve the speed?

Here's python SQL code:

from pycaret.classification import *
# init setup
clfl = setup(data=SQL_Query, target = 'cat_ind',silent = True, html = False,categorical_features= [cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,cat11,cat12,cat13,cat14,cat15,cat16,cat17],numeric_features=['amt'],ignore_features=['paid','catignore']remove_outliers=True,train_size=0.9,handle_unknown_categorical=True, unknown_categorical_method='most_frequent'))
desertnaut
  • 57,590
  • 26
  • 140
  • 166
rlee300
  • 59
  • 3
  • 2
    I would try to dump data to flat file (CSV). Then I will try MLJAR AutoML https://github.com/mljar/mljar-supervised It can handle missing values and categorical columns. Additionally, it will produce reports for trained ML models. – pplonski Apr 08 '21 at 12:08
  • Not all algorithms support GPU. Most algorithms don't. You need to read the document first. – Frank Aug 02 '21 at 13:14

2 Answers2

0

In pycaret, you can use use_gpu=True and Turbo=True

Gopakumar G
  • 106
  • 3
0

What shape is the data after setup()?

With that many categorical features there's a chance your features multiplied by orders of magnitude due to default one hot enconding pycaret setup() uses.

If that is the case, you should use high_cardinality_features to specify the features that have high number of unique values and then specify high_cardinality_method as either 'frequency' or 'clustering'.

Check the documentation here for more info.

MarcinKamil
  • 135
  • 8