How change numeric column to categorical data on Driverless AI

Question

I have try to use Driverless AI using the docker version. When I try to import my data I have a problem on recognize which data are real numeric and the categorical variables.

How can fix this?

score 1 · Answer 1 · answered Apr 29 '19 at 19:34

The handling of categorical and user control is described in the DAI documentation FAQ. I will repost here for your convenience:

How does Driverless AI deal with categorical variables? What if an integer column should really be treated as categorical?

If a column has string values, then Driverless AI will treat it as a categorical feature. There are multiple methods for how Driverless AI converts the categorical variables to numeric. These include:

One Hot Encoding: creating dummy variables for each value
Frequency Encoding: replace category with how frequently it is seen in the data
Target Encoding: replace category with the average target value (additional steps included to prevent overfitting)
Weight of Evidence: calculate weight of evidence for each category (http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/) Driverless AI will try multiple methods for representing the column and determine which representation(s) are best.

If the column has integers, Driverless AI will try treating the column as a categorical column and numeric column. It will treat any integer column as both categorical and numeric if the number of unique values is less than 50.

This is configurable in the config.toml file:

# Whether to treat some numerical features as categorical
# For instance, sometimes an integer column may not represent a numerical feature but
# represent different numerical codes instead.
num_as_cat = true

# Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only)
max_int_as_cat_uniques = 50

(Note: Driverless AI will also check if the distribution of any numeric column differs significantly from the distribution of typical numerical data using Benford’s Law. If the column distribution does not obey Benford’s Law, we will also try to treat it as categorical even if there are more than 50 unique values.)

How change numeric column to categorical data on Driverless AI

1 Answers1