1

I am trying to import csv files into a pandas Dataframe and write this Dataframe to a PostgreSQL database. However, after the writing of the Dataframe to the SQL Database, rows appear in the final SQL Database that are not in the Dataframe.

The original datafiles contain some measurement errors that are stored as None, which I replace within the Dataframe as NaN values through:

 modes['wspd [m/s]'].replace('None', np.NaN, inplace=True)

Then, I write the Dataframe to the SQL database through:

result.to_sql( sql_table, schema='earth', con=engine, index=False, if_exists='replace', chunksize=10000)

I use a chunksize of 10000 as the total Dataframe I am trying to write contains 13.000.000 lines. Here comes the weird part, as an example the rows 60-68 in the Dataframe I want to write to the SQL database look like:

date_time                 ...        cross_wnd [kt]                                       
2017-03-07T00:10:00       ...             -7.910255
2017-03-07T00:10:10       ...             -7.119229
2017-03-07T00:10:20       ...             -7.119229
2017-03-07T00:10:30       ...             -7.910255
2017-03-07T00:10:40       ...             -7.910255
2017-03-07T00:10:50       ...             -7.910255
2017-03-07T00:11:00       ...             -7.910255
2017-03-07T00:11:10       ...             -7.910255

But in the final SQL database the same rows 60-68 contain a new row of a different date (and thus csv file) that has the None value replaced with NaN:

    date_time                 ...        cross_wnd [kt]
60  2017-03-07T00:10:00       ...             -7.910255
61  2017-03-07T00:10:10       ...             -7.119229
62  2017-03-07T00:10:20       ...             -7.119229
63  2017-03-07T00:10:30       ...             -7.910255
64  2017-03-07T00:10:40       ...             -7.910255
65  2017-03-25T10:54:30       ...                   NaN
66  2017-03-07T00:10:50       ...             -7.910255
67  2017-03-07T00:11:00       ...             -7.910255

This strange feature occurs every 66 rows only for the 8 rows in the csv file of 2017-03-25 that contained None values and have been replaced. The rest of the files are processed correctly, the None values are replaced without any trouble and the rest of the Dataframe is stored to the SQL database.

Anyone has an idea how this can happen and what I can do to solve it? I could just remove these 8 lines in the final SQL database but I would like to use the python script for more datafiles. Could it be due to the inplace = True parameter in the replace function? Or due to the large amount of rows not being written correctly to the database?

JeroenvH
  • 11
  • 1

0 Answers0