Your compound dtype
loaded the file as a 1d array with 3 fields
In [195]: data=np.genfromtxt('stack39872346.txt',delimiter=',',dtype='S32,float,int')
In [196]: data
Out[196]:
array([(b'"fabc"', 3000.0, 1), (b'"fdef"', 3650.0, 1),
(b'"ghi"', 3000.0, 2)],
dtype=[('f0', 'S32'), ('f1', '<f8'), ('f2', '<i4')])
In [197]: data.shape
Out[197]: (3,)
In [198]: data.dtype
Out[198]: dtype([('f0', 'S32'), ('f1', '<f8'), ('f2', '<i4')])
Your Dataset1
is 2d with string dtype:
In [207]: Dataset1
Out[207]:
array([['abc ', '3000.0', '1'],
['def', '3650.0', '1'],
['xyz', '3000.0', '2']],
dtype='<U6')
Converting a compound dtype to a simple one is a little tricky. It can be done with astype
. But perhaps it is simpler to use the list version of data
as the intermediary.
In [203]: data.tolist()
Out[203]: [(b'"fabc"', 3000.0, 1), (b'"fdef"', 3650.0, 1), (b'"ghi"', 3000.0, 2)]
In [204]: np.array(data.tolist())
Out[204]:
array([[b'"fabc"', b'3000.0', b'1'],
[b'"fdef"', b'3650.0', b'1'],
[b'"ghi"', b'3000.0', b'2']],
dtype='|S6')
np.array
has read the list of tuples, and created a 2d array with the most-common type, S6
(Py3 bytestring)
Now it is easy to convert to unicode string with astype
:
In [205]: np.array(data.tolist()).astype("U6")
Out[205]:
array([['"fabc"', '3000.0', '1'],
['"fdef"', '3650.0', '1'],
['"ghi"', '3000.0', '2']],
dtype='<U6')
This is similar to Dataset1
, except that the first column is double quoted.
I could skip the last astype
by specifying dtype
: np.array(data.tolist(),dtype=str)
Better yet, tell that to the genfromtxt
:
np.genfromtxt('stack39872346.txt',delimiter=',',dtype=str)
A nice thing about the original compound dtype is that you can access the numeric fields as numbers:
In [214]: data['f1']
Out[214]: array([ 3000., 3650., 3000.])
In [215]: Dataset1[:,1]
Out[215]:
array(['3000.0', '3650.0', '3000.0'],
dtype='<U6')
I haven't addressed the double quotes. The csv
reader can strip those; genfromtxt
does not. Though fortunately you don't have delimiters within the quotes, so I could write a converter
that would strip them off during the genfromtxt
read.
=================
def foo(astr):
return astr[1:-1] # crude dequote
In [223]: data=np.genfromtxt('stack39872346.txt',delimiter=',',
dtype='U6,float,int', converters={0:foo})
In [224]: data
Out[224]:
array([('fabc', 3000.0, 1),
('fdef', 3650.0, 1),
('ghi', 3000.0, 2)],
dtype=[('f0', '<U6'), ('f1', '<f8'), ('f2', '<i4')])
In [225]: np.array(data.tolist())
Out[225]:
array([['fabc', '3000.0', '1'],
['fdef', '3650.0', '1'],
['ghi', '3000.0', '2']],
dtype='<U6')
It looks like I have to use a compound dtype when loading with a converter
.