5

I'm trying to use genfromtxt with Python3 to read a simple csv file containing strings and numbers. For example, something like (hereinafter "test.csv"):

1,a
2,b
3,c

with Python2, the following works well:

import numpy
data=numpy.genfromtxt("test.csv", delimiter=",", dtype=None)
# Now data is something like [(1, 'a') (2, 'b') (3, 'c')]

in Python3 the same code returns [(1, b'a') (2, b'b') (3, b'c')]. This is somehow expected due to the different way Python3 reads the files. Therefore I use a converter to decode the strings:

decodef = lambda x: x.decode("utf-8")
data=numpy.genfromtxt("test.csv", delimiter=",", dtype="f8,S8", converters={1: decodef})

This works with Python2, but not with Python3 (same [(1, b'a') (2, b'b') (3, b'c')] output. However, if in Python3 I use the code above to read only one column:

data=numpy.genfromtxt("test.csv", delimiter=",", usecols=(1,), dtype="S8", converters={1: decodef})

the output strings are ['a' 'b' 'c'], already decoded as expected.

I've also tried to provide the file as the output of an open with the 'rb' mode, as suggested at this link, but there are no improvements.

Why the converter works when only one column is read, and not when two columns are read? Could you please suggest me the correct way to use genfromtxt in Python3? Am I doing something wrong? Thank you in advance!

Alessandro
  • 371
  • 3
  • 11
  • What's the question here? – wim May 16 '13 at 07:27
  • @wim Edited. Now the question should look more clear. – Alessandro May 16 '13 at 07:38
  • Same issue here. I was very confused at first by the use of bytes b' literal, instead of an expected string. I had csv file with 2 columns: sentiment with 0 or 1 value; and text (UTF-16) Handling the columns separately using this decoding approach was viable – chri3g91 Apr 28 '19 at 09:03

3 Answers3

9

The answer to my problem is using the dtype for unicode strings (U2, for example).

Thanks to the answer of E.Kehler, I found the solution. If I use str in place of S8 in the dtype definition, then the output for the 2nd column is empty:

numpy.genfromtxt("test.csv", delimiter=",", dtype='f8,str')

the output is:

array([(1.0, ''), (2.0, ''), (3.0, '')], dtype=[('f0', '<f16'), ('f1', '<U0')])

This suggested me that correct dtype to solve my problem is an unicode string:

numpy.genfromtxt("test.csv", delimiter=",", dtype='f8,U2')

that gives the expected output:

array([(1.0, 'a'), (2.0, 'b'), (3.0, 'c')], dtype=[('f0', '<f16'), ('f1', '<U2')])

Useful information can be also found at the numpy datatype doc page .

Alessandro
  • 371
  • 3
  • 11
1

In python 3, writing

dtype="S8"

(or any variation of "S#") in NumPy's genfromtxt yields a byte string. To avoid this and get just an old fashioned string, write

dtype=str

instead.

E. Kehler
  • 11
  • 1
  • Thank you for your answer. However, it didn't solved my problem, because using `str` the output of: `numpy.genfromtxt("test.csv", delimiter=",", dtype='f8,str')` gives an empty string for the data in the 2nd column (and a `dtype` ` – Alessandro Jul 12 '13 at 00:32
0
training = np.genfromtxt('twitter_train.csv', delimiter=',', usecols=(0,1), dtype='U')

In my case, the first column contains a sentiment value of either 0 or 1 and the second column is a string of many characters representing a tweet in this ex. dtype='U' removed the b' from being included.

So in your case it would be: data=numpy.genfromtxt("test.csv", delimiter=",", dtype='U')

chri3g91
  • 1,196
  • 14
  • 16