peewee orm: bulk insert using a subquery but is based on python-side-data

Question

peewee allows bulk inserts via insert_many() and insert_from(), however insert_many() allows a list of data to be inserted, but does not allow data computed from other parts of the database. insert_from() does allow data computed from other parts of the database, but does not allow any data to be sent from python.

Example:

Assuming a model structure like so:

class BaseModel(Model):
    class Meta:
        database = db

class Person(BaseModel):
    name = CharField(max_length=100, unique=True)

class StatusUpdate(BaseModel):
    person = ForeignKeyField(Person, related_name='statuses')
    status = TextField()
    timestamp = DateTimeField(constraints=[SQL('DEFAULT CURRENT_TIMESTAMP')], index=True)

And some initial data:

Person.insert_many(rows=[{'name': 'Frank'}, {'name': 'Joe'}, {'name': 'Arnold'}]).execute()
print ('Person.select().count():',Person.select().count())

Output:

Person.select().count(): 3

Say we want to add a bunch new status updates, like the ones in this list:

new_status_updates = [ ('Frank', 'wat')
                     , ('Frank', 'nooo')
                     , ('Joe', 'noooo')
                     , ('Arnold', 'nooooo')]

We might try to use insert_many() like so:

StatusUpdate.insert_many( rows=[{'person': 'Frank', 'status': 'wat'}
                              , {'person': 'Frank', 'status': 'nooo'}
                              , {'person': 'Joe', 'status': 'noooo'}
                              , {'person': 'Arnold', 'status': 'nooooo'}]).execute()

But this would fail: the person field expects a Person model or a Person.id, and we would have to make an extra query to retrieve those from the names.

We might be able to avoid this with insert_from() allows us to make subqueries, but insert_from() has no way of processing our lists or dictionaries. What to do?

score 1 · Answer 1 · edited May 23 '17 at 11:46

One idea is to use the SQL VALUES clause as part of a SELECT statement.

If you are familiar with SQL, you may have seen the VALUES clause before, it is commonly used as part of an INSERT statement like so:

INSERT INTO statusupdate (person_id,status)
VALUES (1, 'my status'), (1, 'another status'), (2, 'his status');

This tells the database to insert three rows - AKA tuples - into the table statusupdate.

Another way of inserting things though is to do something like:

INSERT INTO statusupdate (person_id,status)
SELECT ..., ... FROM <elsewhere or subquery>;

This is equivalent to the insert_from() functionality that peewee provides.

But there is another less common thing you can do: you can use the VALUES clause in any select to provide literal values. Example:

SELECT *
FROM (VALUES (1,2,3), (4,5,6)) as my_literal_values;

This will return a result-set of two rows/tuples, each with 3 values.

So, if you can convert the "bulk" insert into a SELECT/FROM/VALUES statement, you can then do whatever transformations you need to do (namely, convert Person.name values to corresponding Person.id values) and then combine it with the peewee 'insert_from()` functionality.

So let us see how this would look.

First let us begin constructing the VALUES clause itself. We want properly escaped values, so we will use question marks instead of the values for now, and put the actual values in later.

#this is gonna look like '(?,?), (?,?), (?,?)'
# or '(%s,%s), (%s,%s), (%s,%s)' depending on the database type
values_question_marks = ','.join(['(%s, %s)' % (db.interpolation,db.interpolation)]*len(new_status_updates))

The next step is to construct the values clause. Here is our first attempt:

--the %s here will be replaced by the question marks of the clause
--in postgres, you must have a name for every item in `FROM`
SELECT * FROM (VALUES %s) someanonymousname

OK, so now we have a result-set that looks like:

name | status
-----|-------
...  | ...

Except! There are no column names. This will cause us a bit of heartache in a minute, so we have to figure out a way to give the result-set proper column names.

The postgres way would be to just alter the AS clause:

SELECT * FROM (VALUES %s) someanonymousname(name,status)

sqlite3 does not support that (grr).

So we are reduced to a kludge. Luckily stackoverflow provides: Is it possible to select sql server data using column ordinal position, and we can construct something like this:

SELECT NULL as name, NULL as status WHERE 1=0
UNION ALL
SELECT * FROM (VALUES %s) someanonymousname

This works by first creating an empty result-set with the proper column-names, and then concatenating the result-set from the VALUES clause to it. This will produce a result-set that has the proper column-names, will work in sqlite3, and in postgres.

Now to bring this back to peewee:

values_query = """
(
    --a trick to make an empty query result with two named columns, to more portably name the resulting
    --VALUES clause columns (grr sqlite)
    SELECT NULL as name, NULL as status WHERE 1=0
    UNION ALL
    SELECT * FROM (VALUES %s) someanonymousname
)
"""

values_query %= (values_question_marks,)

#unroll the parameters into one large list
#this is gonna look like ['Frank', 'wat', 'Frank', 'nooo', 'Joe', 'noooo' ...]
values_query_params = [value for values in new_status_updates for value in values]

#turn it into peewee SQL
values_query = SQL(values_query,*values_query_params)
data_query = (Person
                .select(Person.id, SQL('values_list.status').alias('status'))
                .from_(Person,values_query.alias('values_list'))
                .where(SQL('values_list.name') == Person.name))


insert_query = StatusUpdate.insert_from([StatusUpdate.person, StatusUpdate.status], data_query)

print (insert_query)
insert_query.execute()
print ('StatusUpdate.select().count():',StatusUpdate.select().count())

Output:

StatusUpdate.select().count(): 4

peewee orm: bulk insert using a subquery but is based on python-side-data

Example:

1 Answers1