0

I am new to numpy and I am trying to generate an array from a CSV file. I was informed that the .genfromtxt method works well in generating an array and automatically detecting and ascribing dtypes. The formula seemingly did this without flaws until I checked the shape of the array.

import numpy as np
taxi = np.genfromtxt("nyc_taxis.csv", delimiter=",", dtype = None, names = True)

taxi.shape

[out]: (89560,)

I believe this shows me that my dataset is now a 1D array. The tutorial I am working on in class has a final result of taxi.shape as (89560,15) but they used a long, tedious for loop, then converted certain columns to floats. But I want to try learn a more efficient way.

The first few lines of the array are

array([(2016, 1, 1, 5, 0, 2, 4, 21.  , 2037,  52. , 0.8,  5.54, 11.65,  69.99, 1),
       (2016, 1, 1, 5, 0, 2, 1, 16.29, 1520,  45. , 1.3,  0.  ,  8.  ,  54.3 , 1),
       (2016, 1, 1, 5, 0, 2, 6, 12.7 , 1462,  36.5, 1.3,  0.  ,  0.  ,  37.8 , 2),
       (2016, 1, 1, 5, 0, 2, 6,  8.7 , 1210,  26. , 1.3,  0.  ,  5.46,  32.76, 1),
       (2016, 1, 1, 5, 0, 2, 6,  5.56,  759,  17.5, 1.3,  0.  ,  0.  ,  18.8 , 2),
       (2016, 1, 1, 5, 0, 4, 2, 21.45, 2004,  52. , 0.8,  0.  , 52.8 , 105.6 , 1),
       (2016, 1, 1, 5, 0, 2, 6,  8.45,  927,  24.5, 1.3,  0.  ,  6.45,  32.25, 1),
       (2016, 1, 1, 5, 0, 2, 6,  7.3 ,  731,  21.5, 1.3,  0.  ,  0.  ,  22.8 , 2),
       (2016, 1, 1, 5, 0, 2, 5, 36.3 , 2562, 109.5, 0.8, 11.08, 10.  , 131.38, 1),
       (2016, 1, 1, 5, 0, 6, 2, 12.46, 1351,  36. , 1.3,  0.  ,  0.  ,  37.3 , 2)],

So I can see from the results that each row has 15 comma-separations (i.e 15 columns) but the shape tells me that it is only 89560 rows and no columns. Am I reading this wrong? Is there a way that I can transform the shape of my taxi array dataset to reflect the true number of columns (i.e 15) as they are in the csv file?

Any and all help is appreciated

  • 1
    Look at the `dtype`. You have a structured array, with 89560 records, and 15 fields. You access fields by name. – hpaulj Apr 26 '20 at 20:24
  • 1
    `np.lib.recfunctions.structured_to_unstructured` can be used to convert a structured array to a 2d unstructured (all floats) one. – hpaulj Apr 26 '20 at 20:26
  • Thank you for the comments. How do I call the field by name? Is it similar to pandas where i can call it directly from the array using the format taxi.name_of_column? Apologies, I am very new to this. – Oscar Agbor Apr 26 '20 at 20:33
  • 1
    `taxi['name_of_column']`. In pandas you can use the name as attribute or indexing key. With a structured array, you have to use the index syntax. If you are already used to using `pandas`, you might find it easier to use `pd.read_csv`. `to_numpy` or `to_records` can be used to create arrays. https://numpy.org/doc/stable/user/basics.rec.html – hpaulj Apr 26 '20 at 20:41
  • Your suggesting woked. Thank you hpaulj. If i may ask one more question? I used the indexing format taxi['name_of_column'] and selected a few columns of the first 5 rows as so; ``` taxi_5 = taxi[:5] fare_comp = taxi_5['trip_length'][:], taxi_5['fare_amount'][:], taxi_5['fees_amount'][:] ``` I want to sum up the values across both axes (0 and 1) by using the attribute array.sum( axis= 1 or 0) but the error message tells me that the fare_comp object is a tuple and not an array. I tried to use the sum(fare_comp) function but it only adds the values in the column direction and not rows. – Oscar Agbor Apr 27 '20 at 07:47

1 Answers1

1

You can use this function to convert your structured to unstructured with your desired data type (assuming all fields are of the same data type, if not, keeping it as structured is better):

import numpy.lib.recfunctions as rfn

taxi = rfn.structured_to_unstructured(taxi, dtype=np.float)
Ehsan
  • 12,072
  • 2
  • 20
  • 33