2

I download a bunch of csv-files from an aws s3-bucket and put them in a dataframe. Before uploading the dataframe to sql server I would like to change the columns of the dataframe to have the right datatypes.

When I run astype('float64') on a column I want to change it not only changes the datatype but also the data.

Code:

df['testcol'] = df['lineId'].astype('float64')

pycharm image of the result

I attached a picture to visualize the error. As you can see the data in the third column (testcol) is different to the data in the second column (lineId) even though only the datatype should be changed.

A pl_id can have multiple lineId's, that's why I added and sorted by pl_id in the picture.

Am I using astype() wrong or is this a pandas bug?

kiesel
  • 65
  • 6

1 Answers1

3

Basically it seems that the float64 is not sufficient to carry that long integer:

np.float64('211052094743748628')
Out[135]: 2.1105209474374864e+17

"The max precision a float 64 can reach is close to 10-16 (unit in the last place (ULP), see en.wikipedia.org/wiki/Floating-point_arithmetic) so the idea of an exact decimal value with significantly more than 16 digits for a floating point is misleading." Numpy float64 vs Python float

Consider maybe using int64, which can be more suitable for the size of Id in your dataset:

np.int64('211052094743748628')
Out[150]: 211052094743748628
RunTheGauntlet
  • 392
  • 1
  • 4
  • 15
  • `int`, more specifically, `Int64` is the datatype I am actually trying to assign to this column, but [this pandas bug](https://github.com/pandas-dev/pandas/issues/25472) says I need to first convert the column to `float`. Any other datatype I could use? – kiesel Jun 24 '20 at 13:43
  • I just tried `df['testcol5']=df['lineId'].apply(int)` but the problem persists. – kiesel Jun 24 '20 at 14:21
  • How about astype('int64') ? – RunTheGauntlet Jun 25 '20 at 07:46
  • That doesn't fail, but it produces the same wrong output `float64` does. – kiesel Jun 25 '20 at 09:06