0

I am using the Google Colab enviroment.

The file I am using can be found here. It is a csv file

https://drive.google.com/open?id=1v7Mm6S8BVtou1iIfobY43LRF8MgGdjfU

Warning: it has several million rows.

This code runs within a minute in Google Colab Python 3 notebook. I tried this several times with no problem.

from numpy import genfromtxt
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' ,  dtype=int)

print(my_data[0:50])

The code below, on the other hand, runs for several minutes before disconnecting from Google Colab's server. I tried multiple times. Eventually colab gives me a 'running out of memory' warning.

from numpy import genfromtxt
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' ,  dtype=int,  names=True)

print(my_data[0:50])

It seems that there used to be an issue with names=True in Python 3 but that issue was fixed https://github.com/numpy/numpy/issues/5411

I check which version I was using in Colab and it was up to date

import numpy as np

print(np.version.version)

>1.14.3
SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • 1
    The problem could be that your csv titles contain characters that numpy does not like. In my case it ignored the () in the title. price(USD) became priceUSD. You may want to check the names using `data = np.genfromtxt("ethereum.csv", delimiter=",", names=True) print(data.dtype)` which will print `('\ufeffdate', ' – Huhngut Jun 23 '22 at 08:01

1 Answers1

1

With

my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' ,  dtype=int, max_rows=100)

I got a (100,4) int array.

With names=True it took long, and then issued an long list of errors, all the same except for line number (even with the max_rows):

Line #4121986 (got 4 columns instead of 3)

The header line is screwy - with an initial blank name:

In [753]: !head ../Downloads/refinedRatings.csv
,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
5,2,26,4
7,2,33,4
8,2,301,5
9,2,2686,5
10,2,3753,5

So based on names it thinks there are 3 columns, but all data lines have 4. Hence the error. I don't know why it ignores the max_rows in this case.

But with my own names

In [755]: np.genfromtxt('../Downloads/refinedRatings.csv',delimiter=',',dtype=in
     ...: t, max_rows=10, names='foo,bar,dat,me')
Out[755]: 
array([(-1, -1,   -1, -1), ( 0,  1,  258,  5), ( 1,  2, 4081,  4),
       ( 2,  2,  260,  5), ( 3,  2, 9296,  5), ( 5,  2,   26,  4),
       ( 7,  2,   33,  4), ( 8,  2,  301,  5), ( 9,  2, 2686,  5),
       (10,  2, 3753,  5)],
      dtype=[('foo', '<i8'), ('bar', '<i8'), ('dat', '<i8'), ('me', '<i8')])

The first record (-1,-1,-1,-1) is the initial bad header line, with -1 inplace of strings it couldn't turn into ints. So we should skip_header as done below.

alternatively:

In [756]: np.genfromtxt('../Downloads/refinedRatings.csv',delimiter=',',dtype=in
     ...: t, max_rows=10, skip_header=1)
Out[756]: 
array([[   0,    1,  258,    5],
       [   1,    2, 4081,    4],
       [   2,    2,  260,    5],
       [   3,    2, 9296,    5],
       [   5,    2,   26,    4],
       [   7,    2,   33,    4],
       [   8,    2,  301,    5],
       [   9,    2, 2686,    5],
       [  10,    2, 3753,    5],
       [  11,    2, 8519,    5]])

In sum, skip the header, and use your own names if you want a structured array.

hpaulj
  • 221,503
  • 14
  • 230
  • 353