I am having difficulty translating a query I created in sql (postgres) into sqlalchemy. In particular, my attempted mapping in sqlalchemy is leading to absurd recursive results that will run far slower than what I originally wrote.
Given the following type of table structure:
metadata
------------------------------
primary_id - integer
secondary_count - integer
property - string (many to each primary_id)
data
-----------------------------
primary_id - integer
secondary_id - integer (many to each primary_id)
primary_json - json bytes
secondary_json - json bytes
I am trying to retrieve pairs of primary and secondary data with the such that:
- we match a given property
- we return only "some" of the primary data (say 1000)
- we return the "best" primary data, which is to say the primary data with the most secondary data.
- we only get "some" (say 10) secondary data per primary entry
The first is easy to accomplish with a join between the two tables, however the second is more complicated. The solution I cam to (see here for the explanation that led me to this solution) in raw SQL is:
SELECT primary_id, primary_json, secondary_json, secondary_count
FROM
(
SELECT primary_id, secondary_count
FROM metadata
WHERE property='whatever I want'
-- Get the "best" 1000 results
ORDER BY secondary_count DESC
LIMIT 1000
) my_primary_ids
LEFT OUTER JOIN LATERAL
(
SELECT primary_json, seondary_json
FROM data
WHERE primary_id = my_primary_ids.primary_id
-- Only return 10 pieces of secondary json per primary json
LIMIT 10
) json_content ON true;
I have tried my utmost to convert this to sqlalchemy, however I keep having the problem that the resulting query re-writes the subquery in the FROM
clause of the lateral join query.
For example, the sqlalchemy code (assuming table object definitions which match the above) below is a partial solution. I think I can add the columns that are lacking (as you will see in the sql that is generated):
from sqlalchemy import true
my_prim_ids_al = (
query(Metadata.primary_id.label('primary_id'),
Metadata.secondary_count.label('secondary_count'))
.filter_by(property='whatever I want')
.order_by(Metadata.secondary_count)
.limit(1000)
.from_self()
.subquery('my_primary_ids')
)
json_content_al = (
query(Data.primary_json.label('primary_json'),
Data.secondary_json.label('secondary_json'))
.filter_by(primary_id=my_primary_ids_al.c.primary_id)
.limit(10)
.from_self()
.subquery('json_content')
.lateral()
)
joined_query = (
my_primary_ids_al
.outerjoin(json_content_al, true())
.subquery('joined_query')
)
The joined query, in long form, is the following, with the aforementioned ridiculous nested structure:
SELECT anon_1.primary_id, anon_1.secondary_count
FROM
(
SELECT metadata.primary_id AS primary_id,
metadata.secondary_count AS secondary_count
FROM metadata
WHERE metadata.property = 'whatever I want'
ORDER BY metadata.secondary_count DESC
LIMIT :param_1
) AS anon_1
LEFT OUTER JOIN LATERAL
(
SELECT anon_4.anon_3_secondary_json AS anon_3_secondary_json,
anon_4.anon_3_primary_json AS anon_3_primary_json,
FROM
(
SELECT anon_3.secondary_json AS anon_3_secondary_json,
anon_3.primary_json AS anon_3_primary_json,
FROM
(
SELECT data.secondary_json AS secondary_json,
data.primary_json AS primary_json,
FROM data
JOIN
(
SELECT anon_1.primary_id AS primary_id,
anon_1.secondary_count AS secondary_count
FROM
(
SELECT metadata.primary_id AS primary_id,
metadata.secondary_count AS secondary_count
FROM metadata
WHERE metadata.property = 'whatever I want'
ORDER BY metadata.secondary_count DESC
LIMIT :param_1
) AS anon_1
) AS primary_ids ON data.primary_id = primary_ides.primary_id
) AS anon_3
LIMIT :param_2) AS anon_4) AS anon_2 ON true
Again, I realize this is an incomplete attempt, in that not all the columns are SELECTed in the beginning, but the key issue is the sqlalchemy is creating an absurd amount of nested queries in the lateral join sub-query. This is the core issue that I have not been able to resolve, and unless it is resolved, there is little point in finishing the rest of the query.