0

I am trying to learn Python, however I am trying to import a dataset and cant get it work correctly...

This dataset contains 16 columns and 16 320 rows saved as txt file. I used the genfromtxt function as follow :

import numpy as np  
dt=np.dtype([('name', np.str_, 16),('platform', np.str_, 16),('year', np.float_, (2,)),('genre', np.str_, 16),('publisher', np.str_, 16),('na_sales', np.float_, (2,)), ('eu_sales', np.float64, (2,)), ('jp_sales', np.float64, (2,)), ('other_sales', np.float64, (2,)), ('global_sales', np.float64, (2,)), ('critic_scores', np.float64, (2,)),('critic_count', np.float64, (2,)),('user_scores', np.float64, (2,)),('user_count', np.float64, (2,)),('developer', np.str_, 16),('rating', np.str_, 16)])  
data=np.genfromtxt('D:\\data3.txt',delimiter=',',names=True,dtype=dt)

I get this error :

ValueError: size of tuple must match number of fields.

But my dt variable contains 16 types one for each column. I specify the datatype because otherwise the strings are replaced by nan.

Any help would be appreciated.

  • 1
    Suggestion: post a few of the first lines from your data3.txt file. Are you sure it has 16 columns? – payne Mar 04 '17 at 14:15
  • Why all the `(2,)` in the dtype? You define 16 fields but all the floats are doubled. Have you tried a `dtype=None` load? That lets it deduce the best dtypes. – hpaulj Mar 04 '17 at 14:57

1 Answers1

0

Look at an array made with your dt:

In [78]: np.ones((1,),dt)
Out[78]: 
array([ ('1', '1', [ 1.,  1.], '1', '1', [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], 
      [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], 
      [ 1.,  1.], '1', '1')], 
      dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8', (2,)), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8', (2,)), ('eu_sales', '<f8', (2,)), ('jp_sales', '<f8', (2,)), ('other_sales', '<f8', (2,)), ('global_sales', '<f8', (2,)), ('critic_scores', '<f8', (2,)), ('critic_count', '<f8', (2,)), ('user_scores', '<f8', (2,)), ('user_count', '<f8', (2,)), ('developer', '<U16'), ('rating', '<U16')])

I count 26 1s (string and float), not the 16 you need. Were you thinking the (2,) denoted a double? It denotes a 2 element subfield.

Take out all those (2,)

In [80]: np.ones((1,),dt)
Out[80]: 
array([ ('1', '1',  1., '1', '1',  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., '1', '1')], 
      dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8'), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8'), ('eu_sales', '<f8'), ('jp_sales', '<f8'), ('other_sales', '<f8'), ('global_sales', '<f8'), ('critic_scores', '<f8'), ('critic_count', '<f8'), ('user_scores', '<f8'), ('user_count', '<f8'), ('developer', '<U16'), ('rating', '<U16')])

Now I have 16 fields that should parse your 16 columns just right.

But often dtype=None works just as well. It lets genfromtxt deduce the best dtype for each field. In that case it will take field names from the column header line (your names=True parameter).

It's a good idea to test complicated lines of code before throwing them into larger scripts. Especially if you in the process of learning.

hpaulj
  • 221,503
  • 14
  • 230
  • 353