I was working with a Pandas dataframe, using the UCI repository credit screening file at http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data
The data contains some missing values and I want to perform a different imputation strategy depending on the data type of the column. For example, if the column is numeric use median imputing, but if it is categorical replace for a category such as "No Value".
I run this code to identify the numeric columns:
#Import data
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/credit-screening/crx.data', header=None)
#Imputation
import numpy as np
data = data.replace('?', np.nan)
numeric_columns = data.select_dtypes(include=[np.number]).columns
And it returns:
Out[67]: Int64Index([2, 7, 10, 14], dtype='int64')
For some reason it is not identifying the column 1 (which is clearly numeric) as such. I believe the cause is that there are some NaN values in the column that are making it look like it is not numeric. Anyone know what's going on and what can I do to identify the column 1 as a numeric?
Thanks!