Questions tagged [categorical-data]

Statistical data type whose value is one of a fixed number of nominal categories.

For analysis, categorical values are considered as abstract entities without any mathematical structure such as an order or a topology, regardless of how they are coded and stored. For more, see the "Categorical variable" article on Wikipedia. There are 2 main types of categorical data which are nominal data and ordinal data.

1770 questions
160
votes
6 answers

How to force R to use a specified factor level as reference in a regression?

How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression? It's just using some level by default. lm(x ~ y + as.factor(b)) with b {0, 1, 2, 3, 4}. Let's say I want to use 3 instead of the zero that…
Matt Bannert
  • 27,631
  • 38
  • 141
  • 207
127
votes
6 answers

Pandas: convert categories to numbers

Suppose I have a dataframe with countries that goes as: cc | temp US | 37.0 CA | 12.0 US | 35.0 AU | 20.0 I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
94
votes
4 answers

pandas dataframe convert column type to string or categorical

How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks! df =…
jklaus
  • 1,194
  • 1
  • 10
  • 16
77
votes
3 answers

Plotting with ggplot2: "Error: Discrete value supplied to continuous scale" on categorical y-axis

The plotting code below gives Error: Discrete value supplied to continuous scale What's wrong with this code? It works fine until I try to change the scale so the error is there... I tried to figure out solutions from similar problem but…
Rechlay
  • 1,457
  • 2
  • 12
  • 19
65
votes
5 answers

Scikit-learn's LabelBinarizer vs. OneHotEncoder

What is the difference between the two? It seems that both create new columns, which their number is equal to the number of unique categories in the feature. Then they assign 0 and 1 to data points depending on what category they are in.
62
votes
4 answers

XGBoost Categorical Variables: Dummification vs encoding

When using XGBoost we need to convert categorical variables into numeric. Would there be any difference in performance/evaluation metrics between the methods of: dummifying your categorical variables encoding your categorical variables from e.g.…
ishido
  • 4,065
  • 9
  • 32
  • 42
60
votes
10 answers

Any way to get mappings of a label encoder in Python pandas?

I am converting strings to categorical values in my dataset using the following piece of code. data['weekday'] = pd.Categorical.from_array(data.weekday).labels For eg, index weekday 0 Sunday 1 Sunday 2 Wednesday 3 …
Gingerbread
  • 1,938
  • 8
  • 22
  • 36
59
votes
6 answers

Make Frequency Histogram for Factor Variables

I am very new to R, so I apologize for such a basic question. I spent an hour googling this issue, but couldn't find a solution. Say I have some categorical data in my data set about common pet types. I input it as a character vector in R that…
OnlyDean
  • 1,025
  • 1
  • 13
  • 25
50
votes
1 answer

Is it possible to read categorical columns with pandas' read_csv?

I have tried passing the dtype parameter with read_csv as dtype={n: pandas.Categorical} but this does not work properly (the result is an Object). The manual is unclear.
Emre
  • 5,976
  • 7
  • 29
  • 42
49
votes
5 answers

How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the…
38
votes
5 answers

How to sort pandas dataframe by custom order on string index

I have the following data frame: import pandas as pd # Create DataFrame df = pd.DataFrame( {'id':[2967, 5335, 13950, 6141, 6169],\ 'Player': ['Cedric Hunter', 'Maurice Baker' ,\ 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\ …
littleworth
  • 4,781
  • 6
  • 42
  • 76
38
votes
4 answers

Create dummies from column with multiple values in pandas

I am looking for for a pythonic way to handle the following problem. The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'], get_dummies() creates 2…
mkln
  • 14,213
  • 4
  • 18
  • 22
36
votes
2 answers

How (and why) do you use contrasts?

Under what cases do you create contrasts in your analysis? How is it done and what is it used for? I checked ?contrasts and ?C - both lead to "Chapter 2 of Statistical Models in S", which is not readily available to me.
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
35
votes
5 answers

Add extra level to factors in dataframe

I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know…
enedene
  • 3,525
  • 6
  • 34
  • 41
34
votes
7 answers

Issue with OneHotEncoder for categorical features

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following: from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc =…
Medo
  • 952
  • 3
  • 11
  • 22
1
2 3
99 100