Issues importing datasets (txt file) with Python using numpy library genfromtxt function

Question

I am trying to learn Python, however I am trying to import a dataset and cant get it work correctly...

This dataset contains 16 columns and 16 320 rows saved as txt file. I used the genfromtxt function as follow :

import numpy as np  
dt=np.dtype([('name', np.str_, 16),('platform', np.str_, 16),('year', np.float_, (2,)),('genre', np.str_, 16),('publisher', np.str_, 16),('na_sales', np.float_, (2,)), ('eu_sales', np.float64, (2,)), ('jp_sales', np.float64, (2,)), ('other_sales', np.float64, (2,)), ('global_sales', np.float64, (2,)), ('critic_scores', np.float64, (2,)),('critic_count', np.float64, (2,)),('user_scores', np.float64, (2,)),('user_count', np.float64, (2,)),('developer', np.str_, 16),('rating', np.str_, 16)])  
data=np.genfromtxt('D:\\data3.txt',delimiter=',',names=True,dtype=dt)

I get this error :

ValueError: size of tuple must match number of fields.

But my dt variable contains 16 types one for each column. I specify the datatype because otherwise the strings are replaced by nan.

Any help would be appreciated.

Suggestion: post a few of the first lines from your data3.txt file. Are you sure it has 16 columns? — payne, Mar 04 '17 at 14:15
Why all the `(2,)` in the dtype? You define 16 fields but all the floats are doubled. Have you tried a `dtype=None` load? That lets it deduce the best dtypes. — hpaulj, Mar 04 '17 at 14:57

hpaulj · Accepted Answer · 2017-03-04T17:24:19.963

Look at an array made with your dt:

In [78]: np.ones((1,),dt)
Out[78]: 
array([ ('1', '1', [ 1.,  1.], '1', '1', [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], 
      [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], 
      [ 1.,  1.], '1', '1')], 
      dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8', (2,)), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8', (2,)), ('eu_sales', '<f8', (2,)), ('jp_sales', '<f8', (2,)), ('other_sales', '<f8', (2,)), ('global_sales', '<f8', (2,)), ('critic_scores', '<f8', (2,)), ('critic_count', '<f8', (2,)), ('user_scores', '<f8', (2,)), ('user_count', '<f8', (2,)), ('developer', '<U16'), ('rating', '<U16')])

I count 26 1s (string and float), not the 16 you need. Were you thinking the (2,) denoted a double? It denotes a 2 element subfield.

Take out all those (2,)

In [80]: np.ones((1,),dt)
Out[80]: 
array([ ('1', '1',  1., '1', '1',  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., '1', '1')], 
      dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8'), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8'), ('eu_sales', '<f8'), ('jp_sales', '<f8'), ('other_sales', '<f8'), ('global_sales', '<f8'), ('critic_scores', '<f8'), ('critic_count', '<f8'), ('user_scores', '<f8'), ('user_count', '<f8'), ('developer', '<U16'), ('rating', '<U16')])

Now I have 16 fields that should parse your 16 columns just right.

But often dtype=None works just as well. It lets genfromtxt deduce the best dtype for each field. In that case it will take field names from the column header line (your names=True parameter).

It's a good idea to test complicated lines of code before throwing them into larger scripts. Especially if you in the process of learning.

Issues importing datasets (txt file) with Python using numpy library genfromtxt function

1 Answers1