I have try to use Driverless AI using the docker version. When I try to import my data I have a problem on recognize which data are real numeric and the categorical variables.
How can fix this?
I have try to use Driverless AI using the docker version. When I try to import my data I have a problem on recognize which data are real numeric and the categorical variables.
How can fix this?
The handling of categorical and user control is described in the DAI documentation FAQ. I will repost here for your convenience:
How does Driverless AI deal with categorical variables? What if an integer column should really be treated as categorical?
If a column has string values, then Driverless AI will treat it as a categorical feature. There are multiple methods for how Driverless AI converts the categorical variables to numeric. These include:
If the column has integers, Driverless AI will try treating the column as a categorical column and numeric column. It will treat any integer column as both categorical and numeric if the number of unique values is less than 50.
This is configurable in the config.toml file:
# Whether to treat some numerical features as categorical
# For instance, sometimes an integer column may not represent a numerical feature but
# represent different numerical codes instead.
num_as_cat = true
# Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only)
max_int_as_cat_uniques = 50
(Note: Driverless AI will also check if the distribution of any numeric column differs significantly from the distribution of typical numerical data using Benford’s Law. If the column distribution does not obey Benford’s Law, we will also try to treat it as categorical even if there are more than 50 unique values.)