Pandas read_csv: low_memory and dtype options

Question

df = pd.read_csv('somefile.csv')

...gives an error:

.../site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why might low_memory=False help?

I have a question about this warning. Is the index of the columns mentioned 0-based? For example column 4 which has a mixed type, is that df[:,4] or df[:,3] — maziar, Mar 21 '16 at 19:15
@maziar when reading a csv, by default a new 0-based index is created and used. — firelynx, May 24 '17 at 14:19

score 700 · Accepted Answer · edited Apr 28 '21 at 02:20

700

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

'datetime64[ns, <tz>]' Which is a time zone aware timestamp.

'category' which is essentially an enum (strings represented by integer keys to save

'period[]' Not to be confused with a timedelta, these objects are actually anchored to specific time periods

'Sparse', 'Sparse[int]', 'Sparse[float]' is for sparse data or 'Data that has a lot of holes in it' Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

'Interval' is a topic of its own but its main use is for indexing. See more here

'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' are all pandas specific integers that are nullable, unlike the numpy variant.

'string' is a specific dtype for working with string data and gives access to the .str attribute on the series.

'boolean' is like the numpy 'bool' but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.

edited Apr 28 '21 at 02:20

John S

85
10

answered Dec 01 '14 at 16:04

firelynx

30,616
9
91
101

9

So, given that setting a `dtype=object` is not more memory efficient, is there any reason to mess with it besides getting rid of the error? – zthomas.nc Aug 31 '16 at 07:09
7

@zthomas.nc yes, Pandas does not need to bother testing what is in the column. Theoretically saving some memory while loading (but none after loading is complete) and theoretically saving some cpu cycles (which you won't notice since disk I/O will be the bottleneck. – firelynx Sep 01 '16 at 11:22
5

"Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified." is there some "coerce" option that could be used to throw away this row instead of crashing? – sparrow Sep 01 '16 at 15:33
5

@sparrow there may be, but last time I used it it had bugs. It may be fixed in the latest version of pandas. `error_bad_lines=False, warn_bad_lines=True` should do the trick. The documentation says it's only valid with the C parser. It also says the default parser is None which makes it hard to know which one is the default. – firelynx Sep 02 '16 at 06:48
"cutting the file into segments and running multiple processes, something that python does not support." The preceding statement is categorically false. – Geoffrey Anderson Nov 20 '16 at 00:26
@GeoffreyAnderson s/python/pandas/, or do you care to elaborate on some other falsehood? – firelynx Nov 24 '16 at 08:07
1

Is there some convenient way to produce the right dtype setting based on a small sample? I'd be fine with just having to do it all carefully by hand if the initial say 100 or 1000 rows weren't consistent with the rest of the file. But manually typing in dicts for knock-off analysis tasks is painful. – nealmcb Dec 17 '16 at 00:47
9

@nealmcb You can read the dataframe with `nrows=100` as an argument and then do `df.dtypes` to see the dtypes you get. However, when reading the whole dataframe with these dtypes, be sure to do a `try/except` so you catch faulty dtype guesses. Data is dirty you know. – firelynx Dec 19 '16 at 08:17
Re @GeoffreyAnderson comment: I guess he was objecting to the "python" typo instead of "pandas" (which has since been corrected), but even in pandas there is the `chunksize` iterator which let's you divide the read into sections (I'm not sure about any multiprocessing capability here, but in the interest of memory management `chunksize` can still be helpful. – JohnE Jun 06 '17 at 16:08
2

Theoretically I understand it's definitely better to specify all col_types ahead of time, but is there some pragmatic solution for when you have thousands of columns? For instance I want to only specify the dtypes of the columns that throw mixed type warnings – 3pitt Jan 11 '18 at 14:34
1

I'd only change "always" to "almost always". I have 24k columns with mixed dtypes, which would be one messy looking ```dtype={}``` specification. It may be inefficient, but it still only takes 10 seconds to read in, which is totally acceptable to me. – James Paul Mason Jun 08 '18 at 14:28
1

@JamesPaulMason If you load it once, you can do `df.dtypes.to_csv('dtypes.csv')` and save the dtypes for future use, easy. But yeah, if it isn't broken, I'm not gonna bash you for not fixing it. – firelynx Jun 08 '18 at 14:42
Nice! Didn't even think of doing that! Pretty clever. If things explode further and it starts taking > 20 seconds to load then I'll implement this – James Paul Mason Jun 08 '18 at 18:35
@firelynx **Specifying dtypes**: what dtypes are accepted by this argument to the call? – CheTesta Jul 26 '18 at 09:11
1

@CheTesta Added dtype rerefence in the answer – firelynx Jul 26 '18 at 18:13
@firelynx Is there a better alternative for dtype=object in terms of memory usage? – Ruthger Righart Feb 14 '19 at 14:49
1

@RuthgerRighart If your data consists of random strings, then no. Numbers should be int or float, dates should be parsed as dates. If you have strings that aren't random you can use categorical. Nothing is worse than object, but some data can't be anything else. – firelynx Feb 21 '19 at 09:39
1

is there also a string datatype for pandas now? – baxx May 20 '20 at 22:35
@baxx yes, since pandas 1.0.0 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.arrays.StringArray.html#pandas.arrays.StringArray – firelynx May 22 '20 at 06:24
Just in case it helps someone... I went round and round and round with this tyring all sorts of things with nothing working. Then it dawned on me: the column names in my CVS were upper case, but I had them lower case in my python script. Might seem obvious, but SQL is my native language, and in most cases there, column name case does not matter. But bottom line, adding a dtype={} solved the problem for me. – John Chase Dec 03 '20 at 14:56
I believe `low_memory=False` is different from `low_memory=True`. `low_memory=True` may result in multiple types within the same column (called mixed types) and `low_memory=False` will read the data twice, inferring the types as discussed in this answer. It would be good to accurately describe how the two options behave. – Josiah Yoder Jun 28 '22 at 17:03
For large datasets, setting `dtype='str'` and then manually parsing any columns that require it to numeric types after the fact often works reasonably well. Of course, explicitly naming the columns and types expected in the file can help future maintainers. – Josiah Yoder Jun 28 '22 at 17:05
imagine your df has 100s of columns, you wouldn't want to individually set the data type for each, therefore you need pandas to guess. Is there a way of guessing the first n rows in a df? So could set this number quite high. – Theo F Aug 01 '23 at 11:49
@TheoF I imagine that if you have a large number of columns you have a large surface area for things to go wrong and you are doing yourself a favour by going through the dtypes. Maybe all your dtypes are the same and you know it, in that case you should be able to optimise your work a bit. But generally, I think it's time well invested. – firelynx Aug 02 '23 at 17:06
@firelynx ok let me re-phrase, imagine a scenario where you have a csv file come in weekly which has 100s of columns, but each week there may be changes in either a) the number of columns, or b) the data types within these columns. I agree it's robust to go through and set dtypes, but for a scenario I describe, you'd need pandas to infer/guess each week. – Theo F Aug 03 '23 at 11:41
1

@TheoF If the column names gives you a hint, you can read them, make some assumptions on certain columns and include them in the dtype specification when you read the dataset. Apart from that, I would let pandas guess, but then set up some hard validation rules on the dataset just to make sure the dataframe looks ok. There should be some rules to your data that you can check. Also, test your code with weird datasets. Generate some ghastly datasets to make sure your code is robust. Pandas has the ability to go completely haywire without giving actual errors if fed data of the wrong type/format – firelynx Aug 04 '23 at 06:41

hd1 · Answer 2 · 2015-04-02T23:39:17.010

76

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

dtype : Type name or dict of column -> type

As for low_memory, it's True by default and isn't yet documented. I don't think its relevant though. The error message is generic, so you shouldn't need to mess with low_memory anyway. Hope this helps and let me know if you have further problems

edited Apr 02 '15 at 23:39

answered Jun 16 '14 at 20:11

hd1

33,938
5
80
91

1

Adding `dtype=unicode` produced: `NameError: name 'unicode' is not defined`. But putting `unicode` in quotes (as in 'unicode') appears to work! – sedeh Feb 19 '15 at 18:06
5

@sedeh You can specify dtypes either as python types or as `numpy.dtype('unicode')`. When you give the dtype option a string, it will try to cast it via the `numpy.dtype()` factory by default. Specifying `'unicode'` will actually not do anything, unicodes are just upcasted to `objects`. You will get `dtype='object'` – firelynx Jul 15 '15 at 07:35

score 62 · Answer 3 · edited Jan 13 '16 at 08:32

62

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.

edited Jan 13 '16 at 08:32

firelynx

30,616
9
91
101

answered Oct 16 '15 at 03:12

Neal

709
5
2

88

This silences the error, but does not actually change anything else. – firelynx Jan 13 '16 at 08:32
3

I have same problem while running 1.5gb datafile – Sitz Blogz May 25 '17 at 09:14
show this error when i tried , C error: out of memory – vampirekabir Feb 03 '21 at 09:06
1

what is low_memory = False doing exactly ? Is it solving the issue or just not showing the error message? – JSVJ Mar 22 '21 at 06:35
1

@JSVJ I think setting low_memory = False solves the problem now (see my answer). It seems there was a time when it was going to be deprecated, but that didn't happen. – Richard DiSalvo Dec 08 '21 at 01:44
Jut be careful, doing this may result in pandas guessing different dtypes than with low_memory=True. It may silence the error but double check for potential silent bugs. – Pab Oct 10 '22 at 22:24

score 27 · Answer 4 · edited Mar 11 '22 at 17:26

27

This worked for me!

file = pd.read_csv('example.csv', engine='python')

edited Mar 11 '22 at 17:26

Henry Ecker

34,399
18
41
57

answered Nov 15 '21 at 03:33

Jerald Achaibar

407
5
9

Also here, 1+ million rows, appreciated – gseattle Jan 16 '22 at 11:53
1

Rather than add yet another answer, I also found `file = pd.read_csv('example.csv', encoding='utf-8', engine='python')` worked best with sql exports and anything mangled by Excel – brianlmerritt Jan 20 '23 at 10:36
Oh, you might also add `, on_bad_lines='warn'` – brianlmerritt Jan 20 '23 at 10:37

score 23 · Answer 5 · answered Sep 02 '16 at 18:17

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

score 7 · Answer 6 · edited Nov 26 '20 at 00:02

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues:

the file contained strange characters (fixed using encoding)
the datatype was not specified (fixed using dtype property)
Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

    df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                     names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                     dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})
    
    try:
        df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
    except:
        df['file_format'] = ''

score 6 · Answer 7 · edited Apr 17 '19 at 14:53

6

It worked for me with low_memory = False while importing a DataFrame. That is all the change that worked for me:

df = pd.read_csv('export4_16.csv',low_memory=False)

edited Apr 17 '19 at 14:53

Paul Roub

36,322
27
84
93

answered Apr 17 '19 at 14:40

Rajat Saxena

93
1
2

This answer is the same answer as [below](https://stackoverflow.com/a/33161955/1983957) and just silences the error but does not change anything else as pointed out by firelynx – Greg Hilston Aug 09 '21 at 12:03

score 5 · Answer 8 · answered Jul 19 '20 at 01:36

According to the pandas documentation, specifying low_memory=False as long as the engine='c' (which is the default) is a reasonable solution to this problem.

If low_memory=False, then whole columns will be read in first, and then the proper types determined. For example, the column will be kept as objects (strings) as needed to preserve information.

If low_memory=True (the default), then pandas reads in the data in chunks of rows, then appends them together. Then some of the columns might look like chunks of integers and strings mixed up, depending on whether during the chunk pandas encountered anything that couldn't be cast to integer (say). This could cause problems later. The warning is telling you that this happened at least once in the read in, so you should be careful. Setting low_memory=False will use more memory but will avoid the problem.

Personally, I think low_memory=True is a bad default, but I work in an area that uses many more small datasets than large ones and so convenience is more important than efficiency.

The following code illustrates an example where low_memory=True is set and a column comes in with mixed types. It builds off the answer by @firelynx

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

# make a big csv data file, following earlier approach by @firelynx
csvdata = """1,Alice
2,Bob
3,Caesar
"""

# we have to replicate the "integer column" user_id many many times to get
# pd.read_csv to actually chunk read. otherwise it just reads 
# the whole thing in one chunk, because it's faster, and we don't get any 
# "mixed dtype" issue. the 100000 below was chosen by experimentation.
csvdatafull = ""
for i in range(100000):
    csvdatafull = csvdatafull + csvdata
csvdatafull =  csvdatafull + "foobar,Cthlulu\n"
csvdatafull = "user_id,username\n" + csvdatafull

sio = StringIO(csvdatafull)
# the following line gives me the warning:
    # C:\Users\rdisa\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
    # interactivity=interactivity, compiler=compiler, result=result)
# but it does not always give me the warning, so i guess the internal workings of read_csv depend on background factors
x = pd.read_csv(sio, low_memory=True) #, dtype={"user_id": int, "username": "string"})

x.dtypes
# this gives:
# Out[69]: 
# user_id     object
# username    object
# dtype: object

type(x['user_id'].iloc[0]) # int
type(x['user_id'].iloc[1]) # int
type(x['user_id'].iloc[2]) # int
type(x['user_id'].iloc[10000]) # int
type(x['user_id'].iloc[299999]) # str !!!! (even though it's a number! so this chunk must have been read in as strings)
type(x['user_id'].iloc[300000]) # str !!!!!

Aside: To give an example where this is a problem (and where I first encountered this as a serious issue), imagine you ran pd.read_csv() on a file then wanted to drop duplicates based on an identifier. Say the identifier is sometimes numeric, sometimes string. One row might be "81287", another might be "97324-32". Still, they are unique identifiers.

With low_memory=True, pandas might read in the identifier column like this:

Just because it chunks things and so, sometimes the identifier 81287 is a number, sometimes a string. When I try to drop duplicates based on this, well,

81287 == "81287"
Out[98]: False

score 5 · Answer 9 · edited Aug 15 '20 at 17:13

5

As the error says, you should specify the datatypes when using the read_csv() method. So, you should write

file = pd.read_csv('example.csv', dtype='unicode')

edited Aug 15 '20 at 17:13

David Buck

3,752
35
31
35

answered Aug 15 '20 at 16:01

Mahmoud Ragab

81
1
4

score 4 · Answer 10 · answered Nov 25 '20 at 23:59

4

Sometimes, when all else fails, you just want to tell pandas to shut up about it:

# Ignore DtypeWarnings from pandas' read_csv                                                                                                                                                                                            
warnings.filterwarnings('ignore', message="^Columns.*")

answered Nov 25 '20 at 23:59

technomage

9,861
2
26
40

2

This doesn’t solve the problem. It just hides it – smerllo Aug 11 '22 at 09:18
1

Depends on whether the warning itself is the problem. – technomage Aug 12 '22 at 13:39

score 2 · Answer 11 · edited Oct 26 '18 at 06:53

I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn't bigger than your system memory, reboot, clear the RAM before proceeding. If you're still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there's no obvious corruption. Broken original data can wreak havoc...

score 2 · Answer 12 · answered Aug 26 '22 at 10:29

Building on the answer given by Jerald Achaibar we can detect the mixed Dytpes warning and only use the slower python engine when the warning occurs:

import warnings

# Force mixed datatype warning to be a python error so we can catch it and reattempt the 
# load using the slower python engine
warnings.simplefilter('error', pandas.errors.DtypeWarning)
try:
    df = pandas.read_csv(path, sep=sep, encoding=encoding)
except pandas.errors.DtypeWarning:
    df = pandas.read_csv(path, sep=sep, encoding=encoding, engine="python")

score 1 · Answer 13 · edited Nov 22 '22 at 08:57

1

This worked for me!

dashboard_df = pd.read_csv(p_file, sep=';', error_bad_lines=False, index_col=False, dtype='unicode')

edited Nov 22 '22 at 08:57

halfelf

9,737
13
54
63

answered Nov 17 '22 at 12:12

Samuel Calado

11
1