I'm trying to load a csv file in a NumPy array for machine learning purpose. Until now I always worked with int or float data but my current csv contains string, float and int so I have some trouble with dtype argument. My datasets has 41188 samples and 8 features, e.g.:
47;"university.degree";"yes";176;1;93.994;-36.4;4.857;"no"
I know that if I specify dtype=None the types will be determined by the contents of each columns:
data = np.genfromtxt(filename, dtype=None, delimiter=";", skip_header=1)
but it apparently doesn't work. First of all, the result of genfromtxt is a numpy ndarray with the following shape:
In [2]: data.shape
Out[2]: (41188,)
while I expect (41188,8)
Instead, If I use the default dtype:
data2 = np.genfromtxt(filename, delimiter=";", skip_header=1)
I obtain the following shape of data:
In [4]: data2.shape
Out[4]: (41188,8)
Secondly, with dtype=None I obtain the following deprecation warning:
VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
That I can fix by using (is it correct?):
encoding='ASCII'
I have 2 questions:
- How can I set the correct type of each columns?
- Why I have to set the encoding?