2

I have a CSV file with different types of data. For example: Some columns are categorical (e.g. name of city) Some are numerical (e.g. price of a product)

I would like to read the data file using Python 3 in such a way that all the categorical data are 1-hot encoded and the numerical data are simply encoded as a scalar value.

Something like this:

import numpy as np

x = np.loadtxt(d, dtype={'names': ('city', 'price')
       'formats': (string, int)})

But here I want to one-hot encode the 'city' column as well.

Is there any dataloader/preprocessor in numpy/pandas/scikit that will help read the csv as well as 1-hot encode some of the columns as well?

Ahsan Tarique
  • 581
  • 1
  • 11
  • 22
  • I'm going to give this a shot, but just a heads up first: numpy probably isn't what you need. It would be good if you could include an example of your csv file and some information about its format. – AMC Oct 18 '19 at 00:38
  • 1
    Normally we load the cvs values as strings, and do the one-hot encoding after. It's easier to do the encoding when you have a whole array of strings to work with. File read is done line by line. – hpaulj Oct 18 '19 at 00:52
  • Load all your data first, and then do the one-hot encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html – PV8 Oct 18 '19 at 09:15

1 Answers1

1

i think you should use pandas package to do this

import pandas as pd
df = pd.read_csv('file_name.csv')
df['city'] = df['city'].astype('str')
df['price'] = df['price'].astype('int')
print(df)