89

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.

data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)

The code either fails with a MemoryError, or just never finishes.

Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.

I am using pandas version 0.11.0.

I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.

The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).

It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...

Stacktrace:

Traceback (most recent call last):
  File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in
 <module>
    wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 216, in _read
    return parser.read()
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 643, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 394, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 525, in _init_dict
    dtype=dtype)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 5338, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
    float_blocks = _multi_blockify(float_items, items)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
    block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .

A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does

data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)

R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...

frnhr
  • 12,354
  • 9
  • 63
  • 90
Anne
  • 6,752
  • 8
  • 33
  • 50
  • 1
    Definitely pandas should not be having issues with csvs that size. Are you able to post this file online? – Andy Hayden Jul 09 '13 at 20:13
  • 1
    You can also try passing `nrows=something small` to `read_csv` to make sure it's not the *size* of the file causing problems, which as Andy said, shouldn't be the case. – TomAugspurger Jul 09 '13 at 20:16
  • 1
    it could be something to do with "Visual Studio, using Anaconda and PTVS"... maybe try in regular python too – Andy Hayden Jul 09 '13 at 20:18
  • The only way you can run in ``Visual Studio`` is if you compiled properly (with the correct ABI and such)...not trivial...are you win32 or amd64 (which is much harder to compile)? This should work just fine using the release binaries – Jeff Jul 09 '13 at 20:24
  • 1
    Thanks a lot for the comments. The code blows up with the same problem if I run it from the Python shell. I can get it to work if I cut down the lines (currently there are 900'000 lines, 71 columns). Turns out that even having only "empty" lines, i.e. ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,," will blow up Python, although the file size for this is fairly small (63 Mb). Is there a limit as to how big a dataframe can get? Maybe the number of lines x columns is a problem? – Anne Jul 10 '13 at 14:23
  • The more I am trying to figure out what is happening, the more confused I get. Turns out that `data = pandas.read_csv(filepath)` works, it is only when I add arguments such as `header = 1` that it starts failing. Although for delimiters other than "," I then get a parsing error: `CParserError: Error tokenizing data. C error: Expected 1 fields in line 95, saw 2`. – Anne Jul 10 '13 at 16:02
  • 3
    I have found the following to solve the problem: Read the csv as chunks `csv_chunks = pandas.read_csv(filepath, sep = DELIMITER,skiprows = 1, chunksize = 10000)`, then concatenate the chunks `df = pandas.concat(chunk for chunk in csv_chunks)`. I am still interested to know why reading it in one go doesn't work, to me this looks like an issue with the csv reader. – Anne Jul 10 '13 at 16:55
  • How much available memory on your machine? – Wes McKinney Jul 10 '13 at 18:38
  • Wes - According to task manager, physical memory seems 1.4 GB available (of a total 2.98 GB). Not sure how to check virtual memory though. – Anne Jul 10 '13 at 19:48
  • 11
    If anyone is still following this, I have a bit of an update. I have come to believe that the csv parser is fine (and very fast too), but there is a memory issue of some sort when creating data frames. The reason I believe this: When I use the `chunksize=1000` hack to read the csv, and then try concatenating all the chunks into a big dataframe, it is at this point that memory blows up, with about 3-4x a memory footprint compared to the size of the original file. Does anyone have an idea why the dataframe might blow up? – Anne Jul 12 '13 at 15:58
  • not really sure, but we had similar problems with a 32bit python distro, 64bit did it. – bmu Aug 27 '13 at 16:41
  • I have faced this issue when I tried read_csv against a a 500Mb file. But then I could clearly see available RAM running out on the Windows Task Manager.Do you see this happening when you use just the Python interpreter ? Also, do give a try with ver 0.12 of Pandas. – fixxxer Nov 17 '13 at 17:11
  • I think that I am encountering a similar issue right now with the pandas data frame. I am loading with read_csv 40 files each with 6M rows x 6 columns pandas version 0.10.1 albeit and older version. It continue to give out of memory errors just from reading the csv files using read_csv. I am contemplating using fromtxt or genfromtxt to read the file in, then pass to pandas – Paul Nov 19 '13 at 17:27
  • 2
    Actually discovered the limit was my problem. Now it is working. fromtxt is too slow for me. Now I am encountering different out of memory errors where a table (after merge) is getting incrementally bigger to about 15M rows. Might have to consider memmap and/or hdf5 – Paul Nov 19 '13 at 20:06
  • 1
    I have had similar problems with larger (>200MB) file sizes. There comes a point when the file is being read where RAM usage blows up (from ~4GB to 12GB) even when reading in by chunks. My solution was to read in file line by line to filter out unnecessary rows and then dump into dataframe, though this is certainly sub-optimal. – tnknepp Dec 09 '13 at 15:23
  • Note that np.empty is a call to numpy, the ndimensional matrix library used by pandas. If you want to get some idea of how much memory is being alocated there, step through the code and determine what the shape and dtype values are. http://docs.scipy.org/doc/numpy-1.3.x/reference/generated/numpy.empty.html Numpy's empty method should allocate the memory but not initialize it. It's idle speculation, but empty has had issues with different versions of malloc in the past, https://github.com/numpy/numpy/issues/2179. Might be worthwhile checking if a newer version of numpy is available. – cs_alumnus Dec 21 '13 at 00:10
  • 1
    The speed/efficiency difference between R and SPlus is amazing, so I wouldn't be surprised if Pandas struggled with something like this. – Myles Baker Jan 15 '14 at 03:40
  • I had the same issue running on cmd. If I ran the same program on PowerShell, there's no more error. Were you running on cmd? – John Lin Oct 28 '22 at 07:21

8 Answers8

34

Windows memory limitation

Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

Tricks for lowering memory usage

If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.

How this works

By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

Example

Let's say your csv looks like this:

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01

This example is of course no problem to read into memory, but it's just an example.

If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

Solution

By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

Problem with corrupt data

However, if your csv file would be corrupted, like this:

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz

Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!

Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:

Try it yourself

df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)

df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)
Community
  • 1
  • 1
firelynx
  • 30,616
  • 9
  • 91
  • 101
  • 1
    I can see how this can speedup reading the data, but decreasing memory? Surely it shouldn't need to store more than a few string-values per column to guess the data type? I.e., unless you have a bazillion columns, or the `read_csv` function is doing something incredibly funky, I would be very surprised if the memory usage is noticably higher. – Hannes Ovrén Jul 21 '15 at 14:40
  • 3
    @HannesOvrén The data type cannot be guessed before you have read a significant part of the data, otherwise you risk having to change it multiple times, which stacks the cost. I think pandas by default reads the first million rows before making the guess. I've brought down the memory profile of our pandas based product by 50fold by adding dtypes to csv loads. – firelynx Jul 21 '15 at 15:35
  • 1
    Hmm, thinking about it, I guess it could be problematic to decide if "3" is going to be a float or int unless you also see a "2.5" somewhere. Thanks for the explanation. I didn't know about this. – Hannes Ovrén Jul 21 '15 at 15:39
  • This is not True. With dtype is and in memory more expensive and in time slower. Tested 6 times with dtype in read_csv. The Averages are: ... memory no dtype: 12,121,429.333333334 | memory with dtype: 12,124,160.0 ... In time tested 13 times, the Averages are: ... time no dtypes: 2.0494697460761437 | time with dtypes: 2.100334332539485 ... Used the: import os import psutil process = psutil.Process(os.getpid()) print(process.memory_info().rss) ___Data rows : 1.5 million from three separated datasets, cols 90% are Object type. *Obviously float has less size than string type – nikolaosmparoutis Aug 06 '20 at 11:47
  • @nikolaos_mparoutis Not sure how you came by these results. Maybe you want to write your own answer because it's hard to follow what is code and what is commentary in your comment. My answer is quite old, perhaps something changed. – firelynx Aug 07 '20 at 06:43
6

I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:

df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds

Spyder 3.2.3 Python 2.7.13 64bits

keineahnung2345
  • 2,635
  • 4
  • 13
  • 28
mooseman
  • 1,997
  • 2
  • 17
  • 29
4

I tried chunksize while reading big CSV file

reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation

for chunk in reader:
    print(newChunk.columns)
    print("Chunk -> File process")
    with open(destination, 'a') as f:
        newChunk.to_csv(f, header=False,sep='\t',index=False)
        print("Chunk appended to the file")
muTheTechie
  • 1,443
  • 17
  • 25
2

I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.

Tarik
  • 10,810
  • 2
  • 26
  • 40
1

There is no error for Pandas 0.12.0 and NumPy 1.8.0.

I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.

My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.

Oleksandr
  • 607
  • 6
  • 7
1

I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.

The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.

If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.

What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...

It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?

Several things you can try:

  • Don't load all the data at once, but split in in pieces
  • As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
  • Look if the types are ok, a string '0.111111' needs more memory than a float
  • What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
  • A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)
SebastianNeubauer
  • 543
  • 1
  • 5
  • 9
0

Add these: ratings = pd.read_csv(..., low_memory=False, memory_map=True)

My memory with these two: #319.082.496 Without these two: #349.110.272

nikolaosmparoutis
  • 440
  • 1
  • 6
  • 15
-1

Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_json method instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.

LetMeSOThat4U
  • 6,470
  • 10
  • 53
  • 93