3

I have a project in python that is from kaggle.com. I am having problems reading in the data set. It has one csv file. We need to read it in and put the target and train parts of it in arrays.

Here are the first 3 rows of data set (the target column is the 19th column and the features are the first 18 columns):

user    gender  age how_tall_in_meters  weight  body_mass_index x1  
debora  Woman   46  1.62    75  28.6    -3  
debora  Woman   46  1.62    75  28.6    -3  

The target column which is not shown here has string values.

from pandas import read_csv
import numpy as np
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn import preprocessing
import sklearn.metrics as metrics
from sklearn.cross_validation import train_test_split

#d = pd.read_csv("data.csv", dtype={'A': np.str(), 'B': np.str(), 'S': np.str()})

dataset = np.genfromtxt(open('data.csv','r'), delimiter=',', dtype='f8')[1:]
target = np.array([x[19] for x in dataset])
train = np.array([x[1:] for x in dataset])

print(target)

The error I'm getting is:

Traceback (most recent call last):
  File "C:\Users\Cameron\Desktop\Project - Machine learning\datafilesforproj\SGD_classifier.py", line 12, in <module>
    dataset = np.genfromtxt(open('data.csv','r'), delimiter=',', dtype='f8')[1:]
  File "C:\Python33\lib\site-packages\numpy\lib\npyio.py", line 1380, in genfromtxt
    first_values = split_line(first_line)
  File "C:\Python33\lib\site-packages\numpy\lib\_iotools.py", line 217, in _delimited_splitter
    line = line.split(self.comments)[0]
TypeError: Can't convert 'bytes' object to str implicitly
Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
user3451169
  • 51
  • 1
  • 3
  • Please take some time to format the code properly. – Hooked Apr 27 '14 at 04:06
  • possible duplicate of [Python3 Error: TypeError: Can't convert 'bytes' object to str implicitly](http://stackoverflow.com/questions/16699362/python3-error-typeerror-cant-convert-bytes-object-to-str-implicitly) – CodeManX Apr 27 '14 at 04:45
  • Also, try to come up with a [minimal working example](http://stackoverflow.com/help/mcve). Your problem has nothing to do with Kaggle.com. And you should try to cut your program down to 2 lines to isolate the problem. Also, create an exteremely simple .csv file to practice with. – Garrett Oct 02 '14 at 08:36
  • @CoDEmanX No it's not. This is numpy-specific, cf. my answer. – smheidrich Feb 27 '17 at 13:11

5 Answers5

4

What worked for me was changing the line

dataset = np.genfromtxt(open('data.csv','r'), delimiter=',', dtype='f8')[1:]

to

dataset = np.genfromtxt('data.csv', delimiter=',', dtype='f8')[1:]

(unfortunately, I'm not quite sure what the underlying problem was)

Garrett
  • 4,007
  • 2
  • 41
  • 59
3

This is in fact a bug in numpy, cf. issue #3184.

I'll just copy the workaround that I presented over there:

import functools
import io
import numpy as np
import sys

genfromtxt_old = np.genfromtxt
@functools.wraps(genfromtxt_old)
def genfromtxt_py3_fixed(f, encoding="utf-8", *args, **kwargs):
  if isinstance(f, io.TextIOBase):
    if hasattr(f, "buffer") and hasattr(f.buffer, "raw") and \
    isinstance(f.buffer.raw, io.FileIO):
      # Best case: get underlying FileIO stream (binary!) and use that
      fb = f.buffer.raw
      # Reset cursor on the underlying object to match that on wrapper
      fb.seek(f.tell())
      result = genfromtxt_old(fb, *args, **kwargs)
      # Reset cursor on wrapper to match that of the underlying object
      f.seek(fb.tell())
    else:
      # Not very good but works: Put entire contents into BytesIO object,
      # otherwise same ideas as above
      old_cursor_pos = f.tell()
      fb = io.BytesIO(bytes(f.read(), encoding=encoding))
      result = genfromtxt_old(fb, *args, **kwargs)
      f.seek(old_cursor_pos + fb.tell())
  else:
    result = genfromtxt_old(f, *args, **kwargs)
  return result

if sys.version_info >= (3,):
  np.genfromtxt = genfromtxt_py3_fixed

After putting this at the top of your code, you can just use np.genfromtxt again and it should work fine in Python 3.

smheidrich
  • 4,063
  • 1
  • 17
  • 30
1

As per https://mail.python.org/pipermail/python-list/2012-April/622487.html you probably need

import io
import sys
inpstream = io.open('data.csv','rb')
dataset = np.genfromtxt(inpstream, delimiter=',', dtype='f8')[1:]

In the examples shown in http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html the objects used as file are of class StringIO. Nevertheless, from the specification of the function, I guess that passing the file name should work.

0

You need to decode the bytes object to a str object explicitly, as the TypeError implies.

# For instance, interpret as UTF-8 (depends on your source)
self.comments = self.comments.decode('utf-8')
CodeManX
  • 11,159
  • 5
  • 49
  • 70
  • 3
    The code in question is inside the numpy library. Are you suggesting it's a numpy bug? If so, please edit your answer to clarify this. – max Aug 22 '15 at 23:25
  • Agreed, I am running the first example from here, and that's the same error: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html – andrea Mar 19 '16 at 04:40
-1

Instead of:

dataset = np.genfromtxt(open('data.csv','r'), delimiter=',', dtype='f8')[1:]

try this:

dataset = np.genfromtxt('C:\\\\..\\\\..\\\train.csv', delimiter=',', dtype='None')[1:]

Note that you have to use an extra '\' to escape the other.

djikay
  • 10,450
  • 8
  • 41
  • 52