0

I'm having the following error:

"ParserError: Error tokenizing data. C error: out of memory"

When I try to read a large dataframe (5 gb), but I am selecting only the columns that interest me and setting the necessary parameters, and even so it does not work. I've tried using chunks parameter.

df = pd.read_csv('file.csv', encoding = 'ISO-8859-1', usecols = names_columns, low_memory = False, nrows = 10000)

The strange thing is that when I put the parameter "nrows = 1000" it works.

I've run dataframes with many more rows than that and it worked perfectly, but this one is giving this error.

Someone has any suggestions?

realr
  • 3,652
  • 6
  • 23
  • 34
BrenoShelby
  • 21
  • 1
  • 5
  • 1
    A Dataframe with 1000 rows, many columns, and large data types can be a larger object than a Dataframe with 10000 rows, a few columns, and small data types. Perhaps you may benefit from specifying `dtype`? (see the `dtype` argument in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)) – soyapencil Jan 20 '20 at 22:32
  • hi and welcome to SO! Consider removing `low_memory=False` to prevent OOM errors. – hongsy Jan 21 '20 at 07:18

1 Answers1

1

From this answer:

  1. There should not be a need to mess with low_memory. Remove that parameter option.

  2. Specifying dtypes (should always be done)

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Adding dtype={'user_id': int} to the pd.read_csv() call will make pandas know when it reads the file, that this is only integers.

hongsy
  • 1,498
  • 1
  • 27
  • 39