SQLite - Aggregating related data (tree-like) between 2 tables with multiple potential intermediate relationships

Question

I have an SQLite database (v. 3.8.1) with somewhat unusual schema that can't be changed.

For the purposes of this questions, there are 5 tables (t1 through t5), and I need to create a summary report using data from t1 and t5, but the data I need to reference in t5 can only be gleaned based on relationships to records in t1 through t4.

To help clarify - imagine that t1 holds data regarding a document. The document can subsequently go through 1 to 4 more iterations (with different fields available in each iteration, hence the 5 different tables rather than just a flag in 1 table to signify what iteration it is at).

I'm interested in whether or not an initial record/document (held in t1) has reached it's final iteration or not (a ParentGUID exists in t5 that when followed up the chain of tables, eventually reaches t1, or not).

t1 has a GUID (text) field, and t2 through t5 have GUID and ParentGUID fields (also text). The ParentGUID field in t2 through t5 don't have to be populated (documentation iterations can be skipped in some cases), but when ParentGUID has a value it will always be a GUID from a previous table (for example, if t5 has a ParentGuid value, it will be a GUID from t1, t2, t3 OR t4).

This means that I want all of the distinct records from t1, and then for each a value (or values) from t5 if present, or null if not.

If a ParentGuid field value in a t5 record is the GUID of a record in t4, and the ParentGuid field value in that t4 record is the GUID of a record in t1, then that particular t1 record is considered to have reached its final iteration.

Similarly, ParentGUID > GUID links that will be considered t1 > t5, initial > final iterations include:

t1 > t2 > t3 > t4 > t5
t1 > t2 > t3 > t5
t1 > t2 > t4 > t5
t1 > t2 > t5
t1 > t3 > t4 > t5
t1 > t3 > t5
t1 > t4 > t5
t1 > t5

Or represented graphically:

Possible relationship paths from T1 to T5

Consider the following test schema:

CREATE TABLE Table1
    ("GUID" TEXT, "Name" TEXT)
;

CREATE TABLE Table2
    ("GUID" TEXT, "ParentGUID" TEXT)
;

CREATE TABLE Table3
    ("GUID" TEXT, "ParentGUID" TEXT)
;

CREATE TABLE Table4
    ("GUID" TEXT, "ParentGUID" TEXT)
;

CREATE TABLE Table5
    ("GUID" TEXT, "Name" TEXT, "Amount" REAL, "ParentGUID" TEXT)
;

INSERT INTO Table1
    ("GUID", "Name")
VALUES
    ('ABC', 'A1')
;


INSERT INTO Table1
    ("GUID", "Name")
VALUES
    ('DEF', 'A2')
;

INSERT INTO Table1
    ("GUID", "Name")
VALUES
    ('GHI', 'A3')
;

INSERT INTO Table2
    ("GUID", "ParentGUID")
VALUES
    ('JKL', 'GHI')
;

INSERT INTO Table2
    ("GUID", "ParentGUID")
VALUES
    ('MNO', '')
;

INSERT INTO Table2
    ("GUID", "ParentGUID")
VALUES
    ('PQR', 'GHI')
;

INSERT INTO Table3
    ("GUID", "ParentGUID")
VALUES
    ('STU', 'MNO')
;

INSERT INTO Table3
    ("GUID",  "ParentGUID")
VALUES
    ('STU', 'GHI')
;

INSERT INTO Table3
    ("GUID", "ParentGUID")
VALUES
    ('VWX', 'PQR')
;


INSERT INTO Table4
    ("GUID", "ParentGUID")
VALUES
    ('YZA', 'VWX')
;

INSERT INTO Table4
    ("GUID", "ParentGUID")
VALUES
    ('BCD', '')
;

INSERT INTO Table4
    ("GUID", "ParentGUID")
VALUES
    ('EFG', 'GHI')
;

INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('HIJ', 'EFG', -500, 'E3')
;


INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('KLM', 'YZA', -702, 'E2')
;


INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('NOP', '', 220, 'E8')
;

INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('QRS', 'GHI', 601, 'E4')
;

What I'd like to do is get all records in t1, and then show the total of all related Amount fields from t5 (related in any of the ways listed above), and the group_concat of all the related Name fields from t5.

Using the above sample schema, it would look something like:

t1.Name   total(t5.Amount)   group_concat(t5.Name)
--------------------------------------------------
A1                   0.00  
A2                   0.00  
A3                -601.00    E2,E3,E4

I've tried a bunch of different joins but nothing has worked...either I get too many items in my Total/Group_Concat cells (too high a total due to adding items multiple times, and multiple repeat names such as "E4,E4,E4,E4,E2,E3,E3,E4,E4..."), or I can only get the one directly linked item to t1 from t5 (601.00, E4).

For example, the query just gives me the E4/601.00 result for t1 record GHI:

SELECT DISTINCT t1.guid "OriginalGuid", t1.name "OriginalName", TOTAL(t5."Amount") as "TotalAmount", group_concat(t5.Name) AS "FinalNames"
FROM 
Table1 t1
LEFT  JOIN Table5 t5 ON (t1.GUID=t5.ParentGUID)
LEFT  JOIN Table4 t4 ON (t1.GUID=t4.ParentGuid AND t5.ParentGuid=t4.Guid)
LEFT  JOIN Table3 t3 ON (t1.GUID=t3.ParentGuid AND (t4.ParentGuid=t3.Guid OR t5.ParentGuid=t3.Guid))
LEFT  JOIN Table2 t2 ON (t1.GUID=t2.ParentGuid AND (t3.ParentGuid=t2.Guid AND ((t4.ParentGuid=t3.Guid And t5.ParentGuid=t4.guid) or (t5.ParentGuid=t3.Guid)) OR (t4.ParentGuid=t2.Guid and t5.ParentGuid=t4.Guid) OR (t5.ParentGuid=t2.Guid)))
GROUP BY t1.GUID;

Sadly, I've been working on this for a good chunk of the weekend, and I haven't been able to figure out something that works and is reasonably performant (I had something that appeared to work okay for small datasets but took minutes against my full dataset, which is just too long - I've since lost the SQL for that though unfortunately).

I'm continuing to work on a solution now, and if I find it I will post the answer here, but I'd greatly appreciate any assistance/ideas if they're out there!

Here's my SQL Fiddle: http://sqlfiddle.com/#!5/1a2ac/55

Thanks in advance for any help.

your SQL fiddle data has -500 for the E3 entry in table 5, not -50 as in example above. — simo.3792, Aug 25 '14 at 01:14

Jaugar Chang · Accepted Answer · 2014-08-25T06:25:41.473

cha's answer was ok, but could be optimized by adding a temp table to store all the relations from table2 to table5.

CREATE TABLE TableRel
    ("GUID" TEXT, "ParentGUID" TEXT, "TB" TEXT);

insert into TableRel
select GUID, ParentGUID, 'TABLE2'
FROM TABLE2
UNION ALL
select GUID, ParentGUID, 'TABLE3'
FROM TABLE3
UNION ALL
select GUID, ParentGUID, 'TABLE4'
FROM TABLE4
UNION ALL
select GUID, ParentGUID, 'TABLE5'
FROM TABLE5
;

UPDATE

Then you could use recursive query to get all descendants from table1.

WITH RECURSIVE Table1Descendants(GUID, DescendantGUID,generation) as (
  select t1.GUID, Rel.GUID ,1
  from Table1 t1
  inner join TableRel rel
  on t1.GUID= Rel.ParentGUID
  UNION ALL
  select td.GUID, Rel.GUID, td.generation+1
  from TableRel Rel
  inner join Table1Descendants td
  on td.DescendantGUID= Rel.ParentGUID
  ) 
select t1.guid , t1.name , coalesce(sum(t5.Amount) ,0)
from Table1 as t1
left join Table1Descendants
on t1.GUID = Table1Descendants.GUID
left join Table5 as t5
on t5.GUID = Table1Descendants.DescendantGUID
group by t1.guid,t1.name
order by t1.name;

Or you could get all ancestors from table5.

WITH RECURSIVE Table1Ancestors(GUID, AncestorGUID) as (
  select t5.GUID, Rel.ParentGUID 
  from Table5 t5
  inner join TableRel rel
  on t5.GUID= Rel.GUID
  UNION ALL
  select ta.GUID, Rel.ParentGUID
  from TableRel Rel
  inner join Table1Ancestors ta
  on ta.AncestorGUID= Rel.GUID
  ) 
select t1.guid , t1.name , coalesce(sum(t5.Amount) ,0)
from Table1 as t1
left join Table1Ancestors
on t1.GUID = Table1Ancestors.AncestorGUID
left join Table5 as t5
on t5.GUID = Table1Ancestors.GUID
group by t1.guid,t1.name
order by t1.name;

But only since 3.8.3 SQLite support recursive CTE, I don't have this version of SQLite, here is the SQLFidle tested with PostgreSQL, they have similar grammar with recursive query, but no total and group_concat functions in PostgreSQL.

And here is a none recursive query(SqlFiddle) in case you don't have SQLite 3.8.3 or later version:

select t1.guid "OriginalGuid", t1.name "OriginalName", TOTAL(t5."Amount") as "TotalAmount", group_concat(t5.Name) AS "FinalNames"
from Table1 as t1
left join
(
  select t1.GUID, Rel.GUID as DescendantGUID, 1
  from Table1 t1
  inner join TableRel rel
  on t1.GUID= Rel.ParentGUID
  UNION ALL
  select t1.GUID, Rel2.GUID, 2
  from Table1 t1
  inner join TableRel rel1
  on t1.GUID= Rel1.ParentGUID
  inner join TableRel rel2
  on Rel1.GUID= Rel2.ParentGUID
  UNION ALL
  select t1.GUID, Rel3.GUID, 3
  from Table1 t1
  inner join TableRel rel1
  on t1.GUID= Rel1.ParentGUID
  inner join TableRel rel2
  on Rel1.GUID= Rel2.ParentGUID
  inner join TableRel rel3
  on Rel2.GUID= Rel3.ParentGUID
  UNION ALL
  select t1.GUID, Rel4.GUID, 4
  from Table1 t1
  inner join TableRel rel1
  on t1.GUID= Rel1.ParentGUID
  inner join TableRel rel2
  on Rel1.GUID= Rel2.ParentGUID
  inner join TableRel rel3
  on Rel2.GUID= Rel3.ParentGUID
  inner join TableRel rel4
  on Rel3.GUID= Rel4.ParentGUID
  ) as Table1Descendants
on t1.GUID = Table1Descendants.GUID
left join Table5 as t5
on t5.GUID = Table1Descendants.DescendantGUID
group by t1.guid,t1.name

Result:

OriginalGuid    OriginalName    TotalAmount FinalNames
ABC             A1              0.0 
DEF             A2              0.0 
GHI             A3              -601.0      E3,E2,E4

This seems overly complex. This would probably work where there is an arbitrary level of hierarchy, but the OP has specific 5 levels and not the ability to change it. — simo.3792, Aug 25 '14 at 07:45
@simo.3792095 So I offered another none recursive version instead. — Jaugar Chang, Aug 25 '14 at 07:55
@simo.3792095 By the way, your answer's logic is right, but may generate many duplicate rows when join with `or`. If data grows, it's hard of tuning the performance. — Jaugar Chang, Aug 25 '14 at 08:04
Thanks Jaugar. I've updated my post to indicate that I currently only have access to SQLite 3.8.1, so unfortunately I won't be able to use the recursive approach even though it looks interesting for a more general approach to this kind of problem where the number of possible related tables might be variable. I'll also be testing your solution against my larger datasets to see how it performs vs. the others. — JPBro, Aug 25 '14 at 13:01
@JaugarChang I may have mentioned elsewhere, I have used lots of SQL, but not SQLite, so it may be slightly different. However, I looked at the execution plan on mine compared to your solution and it is significantly shorter, meaning the SQL Engine can optimise it more efficiently. — simo.3792, Aug 26 '14 at 00:00
@JaugarChang - I've acceptedyour answer since it had the best performance (on my larger dataset it produced the results in about 1 second). Thank you very much for your help! — JPBro, Aug 26 '14 at 20:00

score 1 · Answer 2 · answered Aug 25 '14 at 00:36

This query will do this. Basically, you need to UNION ALL all combinations (likely you have a limited number of possible combinations) and then just LEFT JOIN them to T1 and group_concat the names:

SQL Fiddle

SQLite (SQL.js) Schema Setup:

CREATE TABLE Table1
    ("GUID" TEXT, "Name" TEXT)
;

CREATE TABLE Table2
    ("GUID" TEXT, "ParentGUID" TEXT)
;

CREATE TABLE Table3
    ("GUID" TEXT, "ParentGUID" TEXT)
;

CREATE TABLE Table4
    ("GUID" TEXT, "ParentGUID" TEXT)
;

CREATE TABLE Table5
    ("GUID" TEXT, "Name" TEXT, "Amount" REAL, "ParentGUID" TEXT)
;

INSERT INTO Table1
    ("GUID", "Name")
VALUES
    ('ABC', 'A1')
;


INSERT INTO Table1
    ("GUID", "Name")
VALUES
    ('DEF', 'A2')
;

INSERT INTO Table1
    ("GUID", "Name")
VALUES
    ('GHI', 'A3')
;

INSERT INTO Table2
    ("GUID", "ParentGUID")
VALUES
    ('JKL', 'GHI')
;

INSERT INTO Table2
    ("GUID", "ParentGUID")
VALUES
    ('MNO', '')
;

INSERT INTO Table2
    ("GUID", "ParentGUID")
VALUES
    ('PQR', 'GHI')
;

INSERT INTO Table3
    ("GUID", "ParentGUID")
VALUES
    ('STU', 'MNO')
;

INSERT INTO Table3
    ("GUID",  "ParentGUID")
VALUES
    ('STU', 'GHI')
;

INSERT INTO Table3
    ("GUID", "ParentGUID")
VALUES
    ('VWX', 'PQR')
;


INSERT INTO Table4
    ("GUID", "ParentGUID")
VALUES
    ('YZA', 'VWX')
;

INSERT INTO Table4
    ("GUID", "ParentGUID")
VALUES
    ('BCD', '')
;

INSERT INTO Table4
    ("GUID", "ParentGUID")
VALUES
    ('EFG', 'GHI')
;

INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('HIJ', 'EFG', -500, 'E3')
;


INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('KLM', 'YZA', -702, 'E2')
;


INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('NOP', '', 220, 'E8')
;

INSERT INTO Table5
    ("GUID", "ParentGUID", "Amount", "Name" )
VALUES
    ('QRS', 'GHI', 601, 'E4')
;

Query 1:

SELECT t1.GUID, group_concat(o.Name), COALESCE(SUM(o.Amount), 0.0) TotalAmount
FROM Table1 t1 LEFT JOIN
(
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table5 t5 ON (t1.GUID=t5.ParentGUID)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table4 t4 ON (t1.GUID=t4.ParentGuid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t4.Guid)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table3 t3 ON (t1.GUID=t3.ParentGuid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t3.Guid)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table3 t3 ON (t1.GUID=t3.ParentGuid)
INNER JOIN Table4 t4 ON (t4.ParentGuid=t3.Guid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t4.Guid)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table2 t2 ON (t1.GUID=t2.ParentGuid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t2.Guid)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table2 t2 ON (t1.GUID=t2.ParentGuid)
INNER JOIN Table4 t4 ON (t4.ParentGuid=t2.Guid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t4.Guid)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table2 t2 ON (t1.GUID=t2.ParentGuid)
INNER JOIN Table3 t3 ON (t3.ParentGuid=t2.Guid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t3.Guid)
UNION ALL
SELECT t1.GUID, t5.Name, t5.Amount
FROM 
Table1 t1
INNER JOIN Table2 t2 ON (t1.GUID=t2.ParentGuid)
INNER JOIN Table3 t3 ON (t3.ParentGuid=t2.Guid)
INNER JOIN Table4 t4 ON (t4.ParentGuid=t3.Guid)
INNER JOIN Table5 t5 ON (t5.ParentGuid=t4.Guid)
) o ON t1.GUID = o.GUID
GROUP BY t1.GUID

Results:

| GUID | group_concat(o.Name) | TotalAmount |
|------|----------------------|-------------|
|  ABC |                      |         0.0 |
|  DEF |                      |         0.0 |
|  GHI |             E2,E3,E4 |      -601.0 |

Thanks a lot cha! It's easy to see how this solution works by unioning all the possibily linkages, and I am just testing it against my larger data sets for performance to see how it does. I appreciate the help. — JPBro, Aug 25 '14 at 12:54

simo.3792 · Answer 3 · 2014-08-25T02:37:18.403

From what I could see your DISTINCT needs to be applied before you do the TOTAL and CONCAT. These functions were being applied to all rows, and then the DISTINCT applied after. Because there are some many paths to get from t1 to t5, these needed to be excluded in a subquery first.

SELECT sq1.guid "OriginalGuid", sq1.name "OriginalName", TOTAL(sq1."Amount") as "TotalAmount", group_concat(sq1.FinalNames) AS "FinalNames"
FROM
(SELECT DISTINCT t1.guid, t1.name, t5."Amount", t5.Name AS "FinalNames"
  FROM Table1 t1
  LEFT JOIN  Table2 t2 ON (t2.ParentGUID = t1.GUID)
  LEFT JOIN  Table3 t3 ON (t3.ParentGUID = t2.GUID
                           OR (t3.ParentGUID = t1.GUID))
  LEFT JOIN  Table4 t4 ON (t4.ParentGUID = t3.GUID 
                           OR (t4.ParentGUID = t2.GUID) 
                           OR (t4.ParentGUID = t1.GUID))
  LEFT JOIN  Table5 t5 ON (t5.ParentGUID = t4.GUID 
                           OR (t5.ParentGUID = t3.GUID) 
                           OR (t5.ParentGUID = t2.GUID) 
                           OR (t5.ParentGUID = t1.GUID))) sq1
GROUP BY sq1.guid;

As for performance question. I am not familiar with the SQLite field types, but in SQL Server the text type, is a variable object that can store up to 2GB, and cannot be used on indexes. It seems in SQLite, that this is different, but maybe suggest defining this as VARCHAR, if you want to port it to another SQL engine at any stage.

Without knowing your full structure, I would guess that each table has the GUID defined as the primary key, and an index exists for each table's ParentGUID. The JOINS above should not be inherently slow for any particular reason, if keys and indexes are defined correctly.

Thanks Simo! I like this solution due to its compactness. I'm testing it against some of my larger data sets to see if it works in all cases and has good performance. — JPBro, Aug 25 '14 at 12:51
BTW, SQLite just has TEXT, REAL, INTEGER, BLOB, and NULL "storage classes" similar to data types. More information here if you are interested: http://www.sqlite.org/datatype3.html — JPBro, Aug 25 '14 at 12:53

SQLite - Aggregating related data (tree-like) between 2 tables with multiple potential intermediate relationships

3 Answers3