1
  • I am working with very large dataset that won't fit on my RAM (which is 16GB).
  • I noticed that the columns dTypes are all float64, but values in the first 10k rows range from -1.0 to +1.0
  • To check the full dataset would take too much time

I want to specify the dtype in the read_csv for all columns to float16 to reduce the necessary memory:

types = {}

for column in only_first_row_dataframe.columns:
    types[column ] = 'float16'

...

dataframe = pd.read_csv(path, engine="c", dtype = types, lowmemory = False)

After running the above code would I be notified that some values didn't fit into the 16 bit float, and therefor some data were lost?


  • I am asking this question because I tested only the first 10k rows if they fit into the range (-1.0, +1.0)
  • So I want to be sure I won't lose any data.
  • When I do run the code I do not have any warnings and the Dataset is loaded into my RAM, but I am not certain if any data were lost.
  • According to this answer I will be notified if there is a major error in dtypes, so for example if column A will have string value at the end but I specified the dtype as int. But there is no mention about the problem I am asking here.
Jakub Szlaur
  • 1,852
  • 10
  • 39

1 Answers1

4

As you mentioned you will have an error raised if you have a major dtypes error (for example using int64 when the column is float64). However, you won't have an error if for example you use int8 instead of int16 and the range of your values does not match the range of int8 (i.e -128 to 127)

Here is a quick example:

from io import StringIO
import pandas as pd

s = """col1|col2
150|1.5
10|2.5
3|2.5
4|1.2
8|7.5
"""

pd.read_csv(StringIO(s),sep='|', dtype={"col1": "int8"})

And the output is:


    col1    col2
0   -106    1.5
1   10  2.5
2   3   2.5
3   4   1.2
4   8   7.5

So as you can see, the first value in the column col1 has been converted from 150 to -106 without any error / warning from pandas.

The same is applied to float types, I just used int for convenience.

EDIT I add an example with floats since this is what you were asking:

from io import StringIO
import pandas as pd

s = """col1|col2
150|32890
10|2.5
3|2.5
4|1.2
8|7.5
"""

If you read it without specifying the dtype:

pd.read_csv(StringIO(s),sep='|'))

   col1     col2
0   150  32890.0
1    10      2.5
2     3      2.5
3     4      1.2
4     8      7.5

If you read it with specifying the "wrong" dtype for the columns:

pd.read_csv(StringIO(s),sep='|', dtype={"col1": "int8", "col2": "float16"})

   col1          col2
0  -106  32896.000000
1    10      2.500000
2     3      2.500000
3     4      1.200195
4     8      7.500000

If you have a large CSV file and you want to optimize the dtypes, you can load the CSV file column by column (this should not take too much memory) with no dtype then infer the optimal dtype thanks to the values inside the column and load the full CSV with the optimized dtypes.