0

I am trying to populate ElasticSearch with a collection of documents along with a field representing the path to the document based on its parents.

Here is my table layout:

+----+--------+-------+----------+
| Id | Parent | Alias | Contents |
+----+--------+-------+----------+
| 1  | null   | Doc1  | Admin    |
| 2  | 1      | Doc2  | Use      |
| 3  | 2      | Doc3  | Test     |
| 4  | 3      | Doc4  | Ask      |
| 5  | null   | PDF1  | Intro    |
| 6  | 5      | PDF2  | Managers |
+----+--------+-------+----------+

Here is the desired output

+----+--------+-------+----------+---------------------+
| Id | Parent | Alias | Contents | Path                |
+----+--------+-------+----------+---------------------+
| 1  | null   | Doc1  | Admin    | Doc1                |
| 2  | 1      | Doc2  | Use      | Doc1\Doc2           | 
| 3  | 2      | Doc3  | Test     | Doc1\Doc2\Doc3      |
| 4  | 3      | Doc4  | Ask      | Doc1\Doc2\Doc3\Doc4 |
| 5  | null   | PDF1  | Intro    | PDF1                |
| 6  | 5      | PDF2  | Managers | PDF1\PDF2           |
+----+--------+-------+----------+---------------------+

I have this query that gets the Path of one document specified by the parameter @child; (aka SET @child = 5; )

SELECT 
    T2.*
FROM
    (SELECT 
        @r AS _id,
            (SELECT 
                    @r:=Parent
                FROM
                    documents
                WHERE
                    id = _id) AS ParentId,
            @l:=@l + 1 AS lvl
    FROM
        (SELECT @r:=@child, @l:=@parent) vars, documents
    WHERE
        @r <> 0) T1
        JOIN
    documents T2 ON T1._id = T2.Id
ORDER BY T2.Parent

The problem being is how do I set @child if I put this into a subquery? I have tried GROUP_CONCAT() but it always ends up being the same path for every line. I have tried putting the Id of the current row in subquery but it throws an error: ErrorCode: 1109. Unknown table 'doc' in field list in the following query

SELECT doc.*, (
    SELECT GROUP_CONCAT(a.Alias) FROM (SELECT 
        T2.*
    FROM
        (SELECT 
            @r AS _id,
                (SELECT 
                        @r:=Parent
                    FROM
                        documents
                    WHERE
                        id = _id) AS ParentId,
                @l:=@l + 1 AS lvl
        FROM
            (SELECT @r:= doc.Id, @l:=@parent) vars, documents
        WHERE
            @r <> 0) T1
            JOIN
        documents T2 ON T1._id = T2.Id
    ORDER BY T1.lvl DESC) a
) as Path FROM documents doc

What am I doing wrong? Is there a better way to do this that I'm not seeing?

Though it is not entirely relevant, I will point out, I'm using a logstash script to load the documents into ElasticSearch from my database on a schedule. Also for multiplicities sake I have taken out the majority of the columns as well as the contents and replaced with faux contents.

Cardboard
  • 25
  • 4

2 Answers2

0

I have created a decent solution. Its not incredibly fast, but that is too be expected and as this is just for a once a day load, it is acceptable for now.

Essentially, I created a function that gets the Path based on an id, then just run a view (going with a faux materialized view when pushing to production for faster loads to logstash (avoiding the timeouts essentially)) that selects all of the values, and then the path for the appropriate row.

CREATE FUNCTION `get_parent_path` (child int)
RETURNS VARCHAR(1024)
BEGIN
    DECLARE path varchar(1024);
    SELECT 
        GROUP_CONCAT(a.Alias)
    INTO
        path
    FROM (
        SELECT 
            T2.*
        FROM
            (
                SELECT 
                    @r AS _id
                    (
                        SELECT 
                            @r := Parent
                        FROM
                            documents
                        WHERE
                            id = _id
                    ) as ParentId,
                    @l: = @l + 1 as lvl
                FROM
                    (SELECT @r := child, @l := @parent) vars, documents
                WHERE
                    @r <> 0
                ) T1
        JOIN
            documents T2
        ON
            T1._id = T2.Id
        ORDER BY T2.Id
    ) a;

RETURN COALESCE(path, 'invalid child');
END

Then the view I created the view:

CREATE VIEW 'documentswithpath' AS
SELECT *, get_parent_path(Id) FROM documents;

Then I just run SELECT * FROM documentswithpath; from the logstash script. This is also excluding alot of the logic for logstash for a simplistic answer. If anyone has a better, preferably faster, method of doing this, please let me know! Thanks.

Cardboard
  • 25
  • 4
0

You get your error because you cannot use an outer variable in a derived table. A derived table is basically every "subquery" for which you have to use an alias, like vars in your case. Try removing that alias, and MySQL will tell you that every derived table has to have an alias.

One way to solve this is to move your whole query into a function, e.g. getpath(child_id int), where you can then freely use this variable whereever you want (assuming you have a working query that can get the path for one specific child, "something with GROUP_CONCAT()").

But in your case, it is actually possible to reorganize your code so you do not need a derived table:

select d.*, t3.path 
from (
  SELECT t1.id, 
    group_concat(t2.alias order by t1.rownum desc separator '\\' ) as path
  from (
    SELECT 
      current_child.id, 
      lvls.rownum, 
      @r := if(lvls.rownum = 1, current_child.id, @r) AS _id,
      (SELECT @r:=Parent
       FROM documents
       WHERE id = _id) AS ParentId
    FROM (select @rownum:= @rownum+1 as rownum 
       from documents, -- maybe add limit 5
       (select @rownum := 0) vars
      ) as lvls 
      -- or use: 
      -- (select 1 as rownum union select 2 union select 3 
      -- union select 4 union select 5) as lvls
      straight_join documents as current_child 
  ) as t1
  join documents t2
  on t2.id = t1._id
  group by t1.id
) t3
join documents d
on d.id = t3.id;

I used your inner documents the same way as you did, which is actually quite inefficient and is only used to support an unlimited tree depth. If you know your max dependency level, you could use the alternative code for lvls I added as a comment (which is just a list of numbers) or the limit.

Make sure to set the group_concat_max_len-setting to an appropriate value (with e.g. set session group_concat_max_len = 20000;). By default, it supports a length of 1024, which will usually be enough, but for long aliases or really deep trees you might reach it - and since it will give you neither an error nor a warning, it is sometimes hard to diagnose, so be aware of it.

There is a more straight forward way to solve your problem. It requires you to know the maximum depth of your tree though, but if you do, you can simply join your parents to every child.

select child.*, 
  concat_ws('\\',p4.Alias,p3.Alias,p2.Alias,p1.Alias,child.Alias) as path
from documents child
left join documents p1 on child.parent = p1.id
left join documents p2 on p1.parent = p2.id
left join documents p3 on p2.parent = p3.id
left join documents p4 on p3.parent = p4.id;

Generally speaking, the tree you used for your hierarchy does not work very well in sql because of the recursive nature of the model (even if other databases actually support recursive queries in a very similar way you simulated with the variables).

For other ways to model your hierarchy, see e.g. Bill Karwins presentation Models for hierarchical data. They make it a lot easier to query a path without recursion.

Solarflare
  • 10,721
  • 2
  • 18
  • 35
  • I don't know the maximum depth, I also cannot change the structure right now (time constraint) but I found the same you have with the function, I have also implemented it. I will accept your answer as it is more complete than my own! Thanks! (edit:) this was also originally designed for use with Oracle systems (that do support recursive queries) but with limited server space and funds we moved to mySQL for the majority of our systems! – Cardboard Aug 07 '17 at 10:07
  • Haha, yes, it seems you had the same idea. If your depth is really unlimited, make sure `group_concat` can support the concatenated string. I added a remark about that in the answer. – Solarflare Aug 07 '17 at 10:27
  • There is a max length on my "Alias" field of 15, which means I can have just under 69 (ha, I'm mature) levels before running into the issue (with the default of 1024 chars) if every alias is exactly 15 characters long. I don't believe we have anywhere close to 20 layers, let alone 69. Haha! – Cardboard Aug 07 '17 at 10:48