0

Source:https://www.kaggle.com/code/alexisbcook/categorical-variables

In order to drop categorical variables,we use the command

drop_X_train = X_train.select_dtypes(exclude=['object'])

doesnt it make more sense to use

drop_X_train = X_train.select_dtypes(exclude=['string']) since categorical variables have data type string?

aneeq
  • 15
  • 4

1 Answers1

0

pandas deliberately uses native python strings, which require an object dtype. See pandas distinction between str and object types

Also see: https://pandas.pydata.org/docs/user_guide/text.html

df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype("category") 
df["C"] = df["A"].astype("string")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A       4 non-null      object  
 1   B       4 non-null      category
 2   C       4 non-null      string  
dtypes: category(1), object(1), string(1)
memory usage: 328.0+ bytes

print(df)

   A  B  C
0  a  a  a
1  b  b  b
2  c  c  c
3  a  a  a
jch
  • 3,600
  • 1
  • 15
  • 17