Pandas dataframe to Object instances array efficiency for bulk DB insert

Question

I have a Pandas dataframe in the form of:

Time    Temperature    Voltage    Current
0.0     7.8            14         56
0.1     7.9            12         58
0.2     7.6            15         55
... So on for a few hundred thousand rows...

I need to bulk insert the data into a PostgreSQL database, as fast as possible. This is for a Django project, and I'm currently using the ORM for DB operations and building queries, but open to suggestions if there are more efficient ways to accomplish the task.

My data model looks like this:

class Data(models.Model):
    time = models.DateTimeField(db_index=True)
    parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
    parameter_value = models.FloatField()

So Time is row[0] of the DataFrame, and then for each header column, I grab the value that corresponds to it, using the header as parameter. So row[0] of the example table would generate 3 Data objects in my database:

Data(time=0.0, parameter="Temperature", parameter_value=7.8)
Data(time=0.0, parameter="Voltage", parameter_value=14)
Data(time=0.0, parameter="Current", parameter_value=56)

Our application allows the user to parse data files that are measured in milliseconds. So we generate a LOT of individual data objects from a single file. My current task is to improve the parser to make it much more efficient, until we hit I/O constraints on a hardware level.

My current solution is to go through each row, create one Data object for each row on time + parameter + value and append said object to an array so I can Data.objects.bulk_create(all_data_objects) through Django. Of course I am aware that this is inefficient and could probably be improved a lot.

Using this code:

# Convert DataFrame to dict
df_records = df.to_dict('records')

# Start empty dta array
all_data_objects = []

# Go through each row creating objects and appending to data array
for row in df_records:
    for parameter, parameter_value in row.items():
        if parameter != "Time":
            all_data_objects.append(Data(
                    time=row["Time"],
                    parameter_value=parameter_value,
                    parameter=parameter))

# Commit data to Postgres DB
Data.objects.bulk_create(all_data)

Currently the entire operation, without the DB insert operation included (writing to disk), that is, just generating the Data objects array, for a 55mb file that generates about 6 million individual Data objects takes around 370 seconds. Just the df_records = df.to_dict('records') line takes 83ish seconds. Times were measured using time.time() at both ends of each section and calculating the difference.

How can I improve these times?

score 4 · Accepted Answer · answered Feb 04 '20 at 09:40

If you really need a fast solution I suggest you dumb the table directly using pandas.

First let's create the data for your example:

import pandas as pd

data = {
    'Time': {0: 0.0, 1: 0.1, 2: 0.2},
    'Temperature': {0: 7.8, 1: 7.9, 2: 7.6},
    'Voltage': {0: 14, 1: 12, 2: 15},
    'Current': {0: 56, 1: 58, 2: 55}
}
df = pd.DataFrame(data)

Now you should transform the dataframe so that you have the desired columns with melt:

df = df.melt(["Time"], var_name="parameter", value_name="parameter_value")

At this point you should map the parameter values to the foreign id. I will use params as an example:

params = {"Temperature": 1, "Voltage": 2, "Current": 3}
df["parameter"] = df["parameter"].map(params)

At this point the dataframe will look like:

   Time  parameter  parameter_value
0   0.0          1              7.8
1   0.1          1              7.9
2   0.2          1              7.6
3   0.0          2             14.0
4   0.1          2             12.0
5   0.2          2             15.0
6   0.0          3             56.0
7   0.1          3             58.0
8   0.2          3             55.0

And now to export using pandas you can use:

import sqlalchemy as sa
engine = sa.create_engine("use your connection data")
df.to_sql(name="my_table", con=engine, if_exists="append", index=False)

However when I used that it was not fast enough to meet our requirements. So I suggest you use cursor.copy_from insted since is faster:

from io import StringIO

output = StringIO()
df.to_csv(output, sep=';', header=False, index=False, columns=df.columns)
output.getvalue()
# jump to start of stream
output.seek(0)

# Insert df into postgre
connection = engine.raw_connection()
with connection.cursor() as cursor:
    cursor.copy_from(output, "my_table", sep=';', null="NULL", columns=(df.columns))
    connection.commit()

We tried this for a few millions and it was the fastest way when using PostgreSQL.

I'm getting an error after trying this method, on the `cursor.copy_from(output, "data", sep=';', null="NULL", columns=(df.columns))` line. The traceback reads: `Expected bytes or unicode string, got numpy.float64 instead`, I guess it's down to not giving the right data values somewhere. ( `data` is the name of the table I am inserting into). I haven't used this method before, would you have an idea of what's happening here? (At this point I'm way out of my comfort zone, never used sqlalchemy or stringIO before, so I'm pretty much copy/pasting your snippets, while trying to learn) — Mormoran, Feb 04 '20 at 12:05
I am not really sure why are you getting this error but it seems that you have some values as a numpy number instead of string. It is possible that you have one of the columns defined as strings in the database? I suggest you try the `df.to_sql` option before doing the `cursor.copy_from` alternative. It is posible that this option is fast enough for you. — villoro, Feb 04 '20 at 14:27
I can post an example of the output of `df.to_csv()` if that helps, will do in main body — Mormoran, Feb 04 '20 at 14:30
I may also be missing some parameters for the copy operation, for instance, my headers are not the name of the parameter, but an ID. How can I specify that? — Mormoran, Feb 04 '20 at 14:34
I am no sure about what you mean. Could you add the definition of the SQL table where the data should go? — villoro, Feb 04 '20 at 16:52

score 1 · Answer 2 · answered Feb 03 '20 at 16:32

You don't need to create Data object for all rows. SqlAlchemy also supports bulk insert in this way:

data.insert().values([
                    dict(time=0.0, parameter="Temperature", parameter_value=7.8),
                    dict(time=0.0, parameter="Voltage", parameter_value=14)
                ])

See https://docs.sqlalchemy.org/en/13/core/dml.html?highlight=insert%20values#sqlalchemy.sql.expression.ValuesBase.values for more details.

If you only need to insert the data, you don't need pandas and can use a different parsers for your datafile (or write your own, depending on format ot your datafile). Also, it would probably make sense to split the dataset into smaller parts and parallelize the insert command.

Pandas dataframe to Object instances array efficiency for bulk DB insert

2 Answers2