0

I am having difficulty translating a query I created in sql (postgres) into sqlalchemy. In particular, my attempted mapping in sqlalchemy is leading to absurd recursive results that will run far slower than what I originally wrote.

Given the following type of table structure:

metadata
------------------------------
primary_id      - integer
secondary_count - integer
property        - string  (many to each primary_id)

data
-----------------------------
primary_id      - integer
secondary_id    - integer (many to each primary_id)
primary_json    - json bytes
secondary_json  - json bytes

I am trying to retrieve pairs of primary and secondary data with the such that:

  1. we match a given property
  2. we return only "some" of the primary data (say 1000)
  3. we return the "best" primary data, which is to say the primary data with the most secondary data.
  4. we only get "some" (say 10) secondary data per primary entry

The first is easy to accomplish with a join between the two tables, however the second is more complicated. The solution I cam to (see here for the explanation that led me to this solution) in raw SQL is:

SELECT primary_id, primary_json, secondary_json, secondary_count
FROM
  (
    SELECT primary_id, secondary_count
    FROM metadata
    WHERE property='whatever I want'
    -- Get the "best" 1000 results
    ORDER BY secondary_count DESC
    LIMIT 1000
  ) my_primary_ids
 LEFT OUTER JOIN LATERAL
  (
    SELECT primary_json, seondary_json
    FROM data
    WHERE primary_id = my_primary_ids.primary_id
    -- Only return 10 pieces of secondary json per primary json
    LIMIT 10
  ) json_content ON true;

I have tried my utmost to convert this to sqlalchemy, however I keep having the problem that the resulting query re-writes the subquery in the FROM clause of the lateral join query.

For example, the sqlalchemy code (assuming table object definitions which match the above) below is a partial solution. I think I can add the columns that are lacking (as you will see in the sql that is generated):

from sqlalchemy import true

my_prim_ids_al = (
    query(Metadata.primary_id.label('primary_id'), 
          Metadata.secondary_count.label('secondary_count'))
    .filter_by(property='whatever I want')
    .order_by(Metadata.secondary_count)
    .limit(1000)
    .from_self()
    .subquery('my_primary_ids')
    )
json_content_al = (
    query(Data.primary_json.label('primary_json'),
          Data.secondary_json.label('secondary_json'))
    .filter_by(primary_id=my_primary_ids_al.c.primary_id)
    .limit(10)
    .from_self()
    .subquery('json_content')
    .lateral()
    )
joined_query = (
    my_primary_ids_al
    .outerjoin(json_content_al, true())
    .subquery('joined_query')
    )

The joined query, in long form, is the following, with the aforementioned ridiculous nested structure:

SELECT anon_1.primary_id, anon_1.secondary_count
FROM
  (
    SELECT metadata.primary_id AS primary_id, 
           metadata.secondary_count AS secondary_count
    FROM metadata
    WHERE metadata.property = 'whatever I want'                                    
    ORDER BY metadata.secondary_count DESC
    LIMIT :param_1
  ) AS anon_1 
LEFT OUTER JOIN LATERAL 
  (
    SELECT anon_4.anon_3_secondary_json AS anon_3_secondary_json, 
           anon_4.anon_3_primary_json AS anon_3_primary_json, 
    FROM 
      (
        SELECT anon_3.secondary_json AS anon_3_secondary_json, 
               anon_3.primary_json AS anon_3_primary_json,
        FROM 
          (
             SELECT data.secondary_json AS secondary_json, 
                    data.primary_json AS primary_json,
             FROM data 
             JOIN
               (
                  SELECT anon_1.primary_id AS primary_id,
                         anon_1.secondary_count AS secondary_count 
                  FROM 
                    (
                      SELECT metadata.primary_id AS primary_id,
                             metadata.secondary_count AS secondary_count
                      FROM metadata
                      WHERE metadata.property = 'whatever I want'
                      ORDER BY metadata.secondary_count DESC
                      LIMIT :param_1
                   ) AS anon_1
                 ) AS primary_ids ON data.primary_id = primary_ides.primary_id
             ) AS anon_3
           LIMIT :param_2) AS anon_4) AS anon_2 ON true

Again, I realize this is an incomplete attempt, in that not all the columns are SELECTed in the beginning, but the key issue is the sqlalchemy is creating an absurd amount of nested queries in the lateral join sub-query. This is the core issue that I have not been able to resolve, and unless it is resolved, there is little point in finishing the rest of the query.

P. Greene
  • 36
  • 1
  • 3

1 Answers1

0

You do not need both from_self() and subquery(), and the former is in this case messing with auto-correlation and causing the wildly recursive query, because the compiler treats the references to the 1st subquery in- and outside of the second as separate entities. Simply remove the calls to from_self() and the query will be what you are after.

What happens is that when calling from_self() a new Query that selects from the former Query's SELECT statement is created. Applying subquery() then creates a subquery from that, giving 2 levels of nesting. Of course that subquery must be used in yet another query, so there will be 3 levels of nesting, at least. And when the auto-correlation fails and the subquery is included in the second one as is, you get your deeply nested query.

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • You're right! I had added the `from_self()` calls when trying to merge the queries as queries, not sub-queries, and then didn't remove them when I transitioned back to working with sub-queries. That solved my nesting problem. With the help of a coworker, I was also able to get a complete solution, including the selection of the correct columns from the joined query (and some other features that weren't relevant here). I will let him post that when he gets back from vacation though, as he should get the credit for his part. However he hadn't identified the cause of the nesting. Thanks! – P. Greene Aug 31 '18 at 14:08