1

I've got a ndarray which I am trying to read from a CSV file. I can read it via numpy from a file but can't get the structure I want; instead of a 2D array I have an array of tuples

As a MCVE: instead of a 2D array like DataSet1, I have DataSet2:

dataset=numpy.array([
        ["abc ",3000.0,1],
        ["def",3650.0,1],
        ["xyz",3000.0,2]        
        ])
print("DataSet1\n",dataset)
print("DataSet1-Shape\n",dataset.shape)


dataset2=numpy.array([])

dataset2 = np.genfromtxt('file.csv', delimiter=",",dtype='S32,float,int')

print("DataSet2\n",dataset2)
print("DataSet2-Shape\n",dataset2.shape)

The output is:

DataSet1
 [['abc ' '3000.0' '1']
 ['def' '3650.0' '1']
 ['xyz' '3000.0' '2']]
DataSet1-Shape
 (3, 3)
DataSet2
 [(b'"fabc"', 3000.0, 1) (b'"fdef"', 3650.0, 1) (b'"ghi"', 3000.0, 2)]
DataSet2-Shape
 (3,)

I want DataSet2 to be the 2D as DataSet1.

CSV file contents:

"fabc",3000.0,1
"fdef",3650.0,1
"ghi",3000.0,2
P. Camilleri
  • 12,664
  • 7
  • 41
  • 76
user914584
  • 571
  • 8
  • 15

2 Answers2

1

Using a list comprehension and casting tuples to lists with np.array([list(tup) for tup in dataset2]) should work:

>>> np.array([list(tup) for tup in dataset2])
array([['"fabc"', '3000.0', '1'],
       ['"fdef"', '3650.0', '1'],
       ['"ghi"', '3000.0', '2']], 
      dtype='|S6')
>>> np.array([list(tup) for tup in dataset2]).shape
(3, 3)

Also notice your dataset2 = numpy.array([]) is useless because dataset2 is overwritten next line. Edit: [list(tup) for tup in dataset2] is the result of map(list, dataset2)

For mixed types in np arrays see Store different datatypes in one NumPy array?; I suggest you use a pandas.DataFrame instead.

Community
  • 1
  • 1
P. Camilleri
  • 12,664
  • 7
  • 41
  • 76
  • Almost works... except each field value is now a string: [[b'"fabc"' b'3000.0' b'1'] [b'"fdef"' b'3650.0' b'1'] [b'"ghi"' b'3000.0' b'2']] – user914584 Oct 05 '16 at 11:30
  • numpy arrays can have only one type, I think. You can use a pandas.DataFrame if you want mixed type (just do df=pd.DataFrame(your_array)) – P. Camilleri Oct 05 '16 at 11:41
  • `dataset2.tolist()` works just as well as your list comprehension. `np.array` treats the tuples just like lists - unless given a compound dtype. – hpaulj Oct 05 '16 at 15:53
0

Your compound dtype loaded the file as a 1d array with 3 fields

In [195]: data=np.genfromtxt('stack39872346.txt',delimiter=',',dtype='S32,float,int')
In [196]: data
Out[196]: 
array([(b'"fabc"', 3000.0, 1), (b'"fdef"', 3650.0, 1),
       (b'"ghi"', 3000.0, 2)], 
      dtype=[('f0', 'S32'), ('f1', '<f8'), ('f2', '<i4')])
In [197]: data.shape
Out[197]: (3,)
In [198]: data.dtype
Out[198]: dtype([('f0', 'S32'), ('f1', '<f8'), ('f2', '<i4')])

Your Dataset1 is 2d with string dtype:

In [207]: Dataset1
Out[207]: 
array([['abc ', '3000.0', '1'],
       ['def', '3650.0', '1'],
       ['xyz', '3000.0', '2']], 
      dtype='<U6')

Converting a compound dtype to a simple one is a little tricky. It can be done with astype. But perhaps it is simpler to use the list version of data as the intermediary.

In [203]: data.tolist()
Out[203]: [(b'"fabc"', 3000.0, 1), (b'"fdef"', 3650.0, 1), (b'"ghi"', 3000.0, 2)]
In [204]: np.array(data.tolist())
Out[204]: 
array([[b'"fabc"', b'3000.0', b'1'],
       [b'"fdef"', b'3650.0', b'1'],
       [b'"ghi"', b'3000.0', b'2']], 
      dtype='|S6')

np.array has read the list of tuples, and created a 2d array with the most-common type, S6 (Py3 bytestring)

Now it is easy to convert to unicode string with astype:

In [205]: np.array(data.tolist()).astype("U6")
Out[205]: 
array([['"fabc"', '3000.0', '1'],
       ['"fdef"', '3650.0', '1'],
       ['"ghi"', '3000.0', '2']], 
      dtype='<U6')

This is similar to Dataset1, except that the first column is double quoted.

I could skip the last astype by specifying dtype: np.array(data.tolist(),dtype=str)

Better yet, tell that to the genfromtxt:

np.genfromtxt('stack39872346.txt',delimiter=',',dtype=str)

A nice thing about the original compound dtype is that you can access the numeric fields as numbers:

In [214]: data['f1']
Out[214]: array([ 3000.,  3650.,  3000.])
In [215]: Dataset1[:,1]
Out[215]: 
array(['3000.0', '3650.0', '3000.0'], 
      dtype='<U6')

I haven't addressed the double quotes. The csv reader can strip those; genfromtxt does not. Though fortunately you don't have delimiters within the quotes, so I could write a converter that would strip them off during the genfromtxt read.

=================

def foo(astr):
    return astr[1:-1] # crude dequote

In [223]: data=np.genfromtxt('stack39872346.txt',delimiter=',',
     dtype='U6,float,int', converters={0:foo})
In [224]: data
Out[224]: 
array([('fabc', 3000.0, 1), 
       ('fdef', 3650.0, 1), 
       ('ghi', 3000.0, 2)], 
      dtype=[('f0', '<U6'), ('f1', '<f8'), ('f2', '<i4')])

In [225]: np.array(data.tolist())
Out[225]: 
array([['fabc', '3000.0', '1'],
       ['fdef', '3650.0', '1'],
       ['ghi', '3000.0', '2']], 
      dtype='<U6')

It looks like I have to use a compound dtype when loading with a converter.

hpaulj
  • 221,503
  • 14
  • 230
  • 353