I have a Pandas dataframe in the form of:
Time Temperature Voltage Current
0.0 7.8 14 56
0.1 7.9 12 58
0.2 7.6 15 55
... So on for a few hundred thousand rows...
I need to bulk insert the data into a PostgreSQL database, as fast as possible. This is for a Django project, and I'm currently using the ORM for DB operations and building queries, but open to suggestions if there are more efficient ways to accomplish the task.
My data model looks like this:
class Data(models.Model):
time = models.DateTimeField(db_index=True)
parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
parameter_value = models.FloatField()
So Time
is row[0]
of the DataFrame, and then for each header column, I grab the value that corresponds to it, using the header as parameter
. So row[0]
of the example table would generate 3 Data
objects in my database:
Data(time=0.0, parameter="Temperature", parameter_value=7.8)
Data(time=0.0, parameter="Voltage", parameter_value=14)
Data(time=0.0, parameter="Current", parameter_value=56)
Our application allows the user to parse data files that are measured in milliseconds. So we generate a LOT of individual data objects from a single file. My current task is to improve the parser to make it much more efficient, until we hit I/O constraints on a hardware level.
My current solution is to go through each row, create one Data
object for each row on time + parameter + value
and append said object to an array so I can Data.objects.bulk_create(all_data_objects)
through Django. Of course I am aware that this is inefficient and could probably be improved a lot.
Using this code:
# Convert DataFrame to dict
df_records = df.to_dict('records')
# Start empty dta array
all_data_objects = []
# Go through each row creating objects and appending to data array
for row in df_records:
for parameter, parameter_value in row.items():
if parameter != "Time":
all_data_objects.append(Data(
time=row["Time"],
parameter_value=parameter_value,
parameter=parameter))
# Commit data to Postgres DB
Data.objects.bulk_create(all_data)
Currently the entire operation, without the DB insert operation included (writing to disk), that is, just generating the Data
objects array, for a 55mb file that generates about 6 million individual Data
objects takes around 370 seconds. Just the df_records = df.to_dict('records')
line takes 83ish seconds. Times were measured using time.time()
at both ends of each section and calculating the difference.
How can I improve these times?