6

I would like to create a linear ancestry listing for a tree breeding project. The parents are male/female pairs that must not be related (no inbreeding), hence the importance to track and visualize these pedigrees...

Below is the test tables/data using Postgresql 9.1:

DROP TABLE if exists family CASCADE;
DROP TABLE if exists plant CASCADE;

CREATE TABLE family (   
  id serial,
  family_key VARCHAR(20) UNIQUE,
  female_plant_id INTEGER NOT NULL DEFAULT 1,  
  male_plant_id INTEGER NOT NULL DEFAULT 1,   
  filial_n INTEGER NOT NULL DEFAULT -1,  -- eg 0,1,2...  Which would represent None, F1, F2... 
  CONSTRAINT family_pk PRIMARY KEY (id)
);

CREATE TABLE plant ( 
  id serial,
  plant_key VARCHAR(20) UNIQUE,
  id_family INTEGER NOT NULL,  
  CONSTRAINT plant_pk PRIMARY KEY (id),
  CONSTRAINT plant_id_family_fk FOREIGN KEY(id_family) REFERENCES family(id) -- temp may need to remove constraint...
);

-- FAMILY Table DATA:
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (1,'NA',1,1,1); -- Default place holder record
-- Root level Alba families
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (2,'family1AA',2,3,1);
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (3,'family2AA',4,5,1);
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (4,'family3AA',6,7,1);
-- F2 Hybrid Families
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (5,'family4AE',8,11,0); 
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (6,'family5AG',9,12,0);
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (7,'family6AT',10,13,0); 
-- F3 Double Hybrid family:
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (9,'family7AEAG',14,15,0);
-- F3 Tri-hybrid backcross family:
insert into family (id, family_key, female_plant_id, male_plant_id, filial_n) VALUES (10,'family8AEAGAT',17,16,0);

-- PLANT Table DATA:
-- Root level Alba Parents: 
insert into plant (id, plant_key,  id_family) VALUES (1,'NA',1);      -- Default place holder record
insert into plant (id, plant_key,  id_family) VALUES (2,'female1A',1); 
insert into plant (id, plant_key,  id_family) VALUES (3,'male1A',1);
insert into plant (id, plant_key,  id_family) VALUES (4,'female2A',1);
insert into plant (id, plant_key,  id_family) VALUES (5,'male2A',1);
insert into plant (id, plant_key,  id_family) VALUES (6,'female3A',1); 
insert into plant (id, plant_key,  id_family) VALUES (7,'male3A',1);
-- Female Alba progeny:
insert into plant (id, plant_key,  id_family) VALUES (8,'female4A',2);
insert into plant (id, plant_key,  id_family) VALUES (9,'female5A',3);
insert into plant (id, plant_key,  id_family) VALUES (10,'female6A',4);
-- Male Aspen Root level parents:
insert into plant (id, plant_key,  id_family) VALUES (11,'male1E',1); 
insert into plant (id, plant_key,  id_family) VALUES (12,'male1G',1);  
insert into plant (id, plant_key,  id_family) VALUES (13,'female1T',1);
-- F1 Hybrid progeny:
insert into plant (id, plant_key,  id_family) VALUES (14,'female1AE',5); 
insert into plant (id, plant_key,  id_family) VALUES (15,'male1AG',6);  
insert into plant (id, plant_key,  id_family) VALUES (16,'male1AT',7);
-- Hybrid progeny
insert into plant (id, plant_key,  id_family) VALUES (17,'female1AEAG',9);
-- Tri-hybrid backcross progeny:
insert into plant (id, plant_key,  id_family) VALUES (18,'female1AEAGAT',10);
insert into plant (id, plant_key,  id_family) VALUES (19,'female2AEAGAT',10);

Below is the Recursive query that I derived from the Postgres WITH Queries documentation:

WITH RECURSIVE search_tree(
      family_key
    , female_plant
    , male_plant
    , depth
    , path
    , cycle
) AS (
    SELECT 
          f.family_key
        , pf.plant_key
        , pm.plant_key
        , 1
        , ARRAY[ROW(pf.plant_key, pm.plant_key)]
        , false
    FROM 
          family f
        , plant pf
        , plant pm
    WHERE 
        f.female_plant_id = pf.id
        AND f.male_plant_id = pm.id
        AND f.filial_n = 1 -- Include only F1 families (root level)
        AND f.id <> 1      -- omit the default first family record

    UNION ALL

    SELECT  
          f.family_key
        , pf.plant_key
        , pm.plant_key
        , st.depth + 1
        , path || ROW(pf.plant_key, pm.plant_key)
        , ROW(pf.plant_key, pm.plant_key) = ANY(path)
    FROM 
          family f
        , plant pf
        , plant pm
        , search_tree st
    WHERE 
        f.female_plant_id = pf.id
        AND f.male_plant_id = pm.id
        AND f.family_key = st.family_key
        AND pf.plant_key = st.female_plant
        AND pm.plant_key = st.male_plant
        AND f.filial_n <> 1  -- Include only non-F1 families (non-root levels)
        AND NOT cycle
)
SELECT * FROM search_tree;

Below is the desired output:

F1 family1AA=(female1A x male1A) > F2 family4AE=(female4A x male1E) > F3 family7AEAG=(female1AE x male1AG) > F4 family8AEAGAT=(female1AEAG x male1AT)  
F1 family2AA=(female2A x male2A) > F2 family5AG=(female5A x male1G) > F3 family7AEAG=(female1AE x male1AG) > F4 family8AEAGAT=(female1AEAG x male1AT) 
F1 family3AA=(female3A x male3A) > F2 family6AT=(female6A x female1T) > F3 family8AEAGAT=(female1AEAG x male1AT) 

The above Recursive query displays 3 rows with the appropriate F1 parents but the path does not display the downstream families/parents. I would appreciate help to make the recursive output similar to the desired output listed above.

A.H.
  • 63,967
  • 15
  • 92
  • 126
user1888167
  • 131
  • 3
  • 9
  • Nice question; very well put. very complete. I'm working on it... – wildplasser Dec 08 '12 at 20:08
  • I'm not sure I understand how the hierarchy is defined. I can't find a parent/child relationship in the example tables. Can you explain a bit on how the parent (or the child) is found? –  Dec 09 '12 at 12:58
  • Is it possible that the row with `plant.id = 11` should have `2` as the `family_id`? –  Dec 09 '12 at 13:04
  • The parent/child hierarchy is defined by the family/plant ids. Each plant "links" to the family table via the id_family foreign key. Note that plants with Root level parents have an ID of 1 which maps to 'NA' (aka: Not Applicable). Therefore a root level family is one that has BOTH plant parents with with a family ID of 1 (NA). Conversely the family table has female_plant_id and male_plant_id columns for the respective plant IDs. Regarding plant.id(11), it is correct since it is a ROOT level plant and is correctly mapped to family_id of 1 (NA). Start with family ID=2 and build the trees. – user1888167 Dec 09 '12 at 14:14
  • I should add that the family filial_n rows = 1 for each root level family. Assume that this was populated from an update statement using the logic that BoTH plant parents are at root level. Later I will add an update statement to add the other filial numbers. – user1888167 Dec 09 '12 at 14:26

1 Answers1

5

I have adapted the query to what I have understood, not necessarily to what is required :-)

The query starts at the three given families defined by f.id != 1 AND f.filial_n = 1 and recursively expands available children.

On what condition only the last three matches should be selected is beyond my understanding. Perhaps for each starting family the longest chain of anchestors?

WITH RECURSIVE expanded_family AS (
    SELECT
        f.id,
        f.family_key,
        pf.id           pd_id,
        pf.plant_key    pf_key,
        pf.id_family    pf_family,
        pm.id           pm_id,
        pm.plant_key    pm_key,
        pm.id_family    pm_family,
        f.filial_n
    FROM family f
        JOIN plant pf ON f.female_plant_id = pf.id
        JOIN plant pm ON f.male_plant_id = pm.id
),
search_tree AS (
    SELECT
        f.*,
        1 depth,
        ARRAY[f.family_key::text] path
    FROM expanded_family f
    WHERE
        f.id != 1
        AND f.filial_n = 1
    UNION ALL
    SELECT
        f.*,
        depth + 1,
        path || f.family_key::text
    FROM search_tree st
        JOIN expanded_family f
            ON f.pf_family = st.id
            OR f.pm_family = st.id
    WHERE
        f.id <> 1
)
SELECT
    family_key,
    depth,
    path
FROM search_tree;

The result is:

  family_key   | depth |                      path                       
---------------+-------+-------------------------------------------------
 family1AA     |     1 | {family1AA}
 family2AA     |     1 | {family2AA}
 family3AA     |     1 | {family3AA}
 family4AE     |     2 | {family1AA,family4AE}
 family5AG     |     2 | {family2AA,family5AG}
 family6AT     |     2 | {family3AA,family6AT}
 family7AEAG   |     3 | {family1AA,family4AE,family7AEAG}
 family7AEAG   |     3 | {family2AA,family5AG,family7AEAG}
 family8AEAGAT |     3 | {family3AA,family6AT,family8AEAGAT}
 family8AEAGAT |     4 | {family1AA,family4AE,family7AEAG,family8AEAGAT}
 family8AEAGAT |     4 | {family2AA,family5AG,family7AEAG,family8AEAGAT}

Technical stuff:

  • I have removed the cycle stuff because for clean data it should not be necessary (IMHO).

  • expanded_family can be inlined if some odd performance problem occurs, but for now it makes the recursive query more readable.

EDIT

A slight modification of the query can filter these rows where, for each "root" family (i.e. the ones for which the query started), the longest path exists.

I show only the changed part in search_tree, so you have to copy the head from the previous section:

-- ...
search_tree AS
(
    SELECT
        f.*,
        f.id            family_root,   -- remember where the row came from.
        1 depth,
        ARRAY[f.family_key::text] path
    FROM expanded_family f
    WHERE
        f.id != 1
        AND f.filial_n = 1
    UNION ALL
    SELECT
        f.*,
        st.family_root,    -- propagate the anchestor
        depth + 1,
        path || f.family_key::text
    FROM search_tree st
        JOIN expanded_family f
            ON f.pf_family = st.id
            OR f.pm_family = st.id
    WHERE
        f.id <> 1
)
SELECT
    family_key,
    path
FROM
(
    SELECT
        rank() over (partition by family_root order by depth desc),
        family_root,
        family_key,
        depth,
        path
    FROM search_tree
) AS ranked
WHERE rank = 1;

The result is:

  family_key   |                      path                       
---------------+-------------------------------------------------
 family8AEAGAT | {family1AA,family4AE,family7AEAG,family8AEAGAT}
 family8AEAGAT | {family2AA,family5AG,family7AEAG,family8AEAGAT}
 family8AEAGAT | {family3AA,family6AT,family8AEAGAT}
(3 rows)

EDIT2

Based on the comments I added a pretty_print version of the path:

WITH RECURSIVE expanded_family AS (
    SELECT
        f.id,
        pf.id_family    pf_family,
        pm.id_family    pm_family,
        f.filial_n,
        f.family_key || '=(' || pf.plant_key || ' x ' || pm.plant_key || ')' pretty_print
    FROM family f
        JOIN plant pf ON f.female_plant_id = pf.id
        JOIN plant pm ON f.male_plant_id = pm.id
),
search_tree AS
(
    SELECT
        f.id,
        f.id            family_root,
        1 depth,
        'F1 ' || f.pretty_print  path
    FROM expanded_family f
    WHERE
        f.id != 1
        AND f.filial_n = 1
    UNION ALL
    SELECT
        f.id,
        st.family_root,
        st.depth + 1,
        st.path || ' -> F' || st.depth+1 || ' ' || f.pretty_print
    FROM search_tree st
        JOIN expanded_family f
            ON f.pf_family = st.id
            OR f.pm_family = st.id
    WHERE
        f.id <> 1
)
SELECT
    path
FROM
(
    SELECT
        rank() over (partition by family_root order by depth desc),
        path
    FROM search_tree
) AS ranked
WHERE rank = 1;

The result is

    path                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------
 F1 family1AA=(female1A x male1A) -> F2 family4AE=(female4A x male1E) -> F3 family7AEAG=(female1AE x male1AG) -> F4 family8AEAGAT=(female1AEAG x male1AT)
 F1 family2AA=(female2A x male2A) -> F2 family5AG=(female5A x male1G) -> F3 family7AEAG=(female1AE x male1AG) -> F4 family8AEAGAT=(female1AEAG x male1AT)
 F1 family3AA=(female3A x male3A) -> F2 family6AT=(female6A x female1T) -> F3 family8AEAGAT=(female1AEAG x male1AT)
(3 rows)
A.H.
  • 63,967
  • 15
  • 92
  • 126
  • Awesome - I should be able to take it from here! I might use PL/pgsql to remove the duplicate ancestors and add the parent/child formatting. Thanks for your help! You have helped breed better trees!!! – user1888167 Dec 09 '12 at 19:48
  • @user1888167: pl/pgsql is not required. You can add appropriate filters in three places: The `WHERE` of the non-recursive part (where `f.id` and `f.filial_id` are already checked), the recursive `WHERE` and you can also add a filter to the "outer" select. The "outer" `SELECT` is the usual place for stuff like this. To do the filtering you can use more information than the current output shows.
    I just did not know what criteria you want to apply.
    – A.H. Dec 09 '12 at 20:01
  • The most desirable criteria would be to only display the "complete family plaths", which would be the last three rows of your output. So yes, for each starting family it would be the longest unique chain of anchestors? Is this possible? – user1888167 Dec 09 '12 at 20:10
  • Yes, the 3 family paths is great! Hmm, I was able to add the female plant names (males too) by adding ARRAY[f.pf_key::text] and f.pf_key::text to the top 2 "path" lines. How would I add the depth and other characters to get 1 family output like: "F1 family1AA=(female1A x male1A)", where F1 would have the appropriate depth number for each family? I was not able to concatenate "1 depth" and "depth + 1" without casting issues that I could not resolve. This output would allow us to search these paths for any plant or family name. Thanks for your patience! – user1888167 Dec 10 '12 at 03:25
  • @user1888167: I added another version with pretty printed path. – A.H. Dec 10 '12 at 19:40
  • The output is excellent! Lastly, I am having issues inserting the path into a 2 column table with an ID and path columns: CREATE TABLE pedigree (id serial, path VARCHAR NOT NULL); I will be done once I stuff the data into a table. Thanks! – user1888167 Dec 11 '12 at 01:21
  • FWIW: I added the family_key to the output (see above), used the psql command to create a .csv file and loaded it into the database with the COPY command. This fit well with my batch refresh process for these tables. Thanks everyone for your help!!! – user1888167 Dec 14 '12 at 03:03
  • UPDATE: I observed missing rows from the above "Edit 2" query. The output was not capturing "repeated" families that used the same parents but had different family names. The fix was to simply comment out the last line, "WHERE rank = 1". This really rocks! – user1888167 Dec 23 '12 at 01:13