0

I'm searching for an Impala Query idea.

Let me try to explain my problem: it is all about sorting IDs. I have a table with a different types of IDs. A head ID and a kind of sub IDs (for one head ID there are up to 150 sub IDs)

Through a window function (ROW_NUMBER() OVER (PARTITION BY)) it's no problem to sort them. The main problem is, they have a specific order which is stored in a second table.

The second table contains every Sub_ID and which ID comes before and which after.

I managed it to sort these partitions and also identify the first ID, but I have no clue how to sort by the other table.

Let's try to show you an example:

table 1

head_ID sub_ID
1        001
1        002
1        003
2        011
2        012
2        013
2        014

table 2

sub_ID begin_ID end_ID
002     003      001
012     011      0013

I hope you get the idea

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Do you have to have the second table in that format? Can it be changed?. Also, what do the first and last entries of the chain look like? what is the begin_id of the sub that has no prior entry? And how often do the sub_ID orderings change? – DancingFool Dec 05 '18 at 01:14

1 Answers1

0

I think you're going to have to do this in two steps - first sort out the sort order, and then second do your actual query. I suspect that getting the sort order could be expensive (I can't figure out a way to do it that's not looping or recursive), so if possible you should avoid doing it more often than you have to. If your table2 doesn't change very often, and you can change the design, I'd be tempted to store the actual sort order in that table in an extra column.

Any time table2 got modified, you would have to update the sort order before you ran this query again. Because to update the sort order would probably require multiple table2 edits, and the order would actually be broken until the last one, I would probably put a trigger on the table to set a flag that the order needs updating. You can check the flag before this query is run, or in a nightly maintenance run (whichever comes first) and update the order if required.

If you can't change the DB, you can do the sorting before each query run, but depending on your data it could slow things down to much.

Anyway, to figure out the sort order, you need to create an orderno for each row in table2 (either updating the rows directly, or in a separate temp table. It looks like you have a number of order lists in the data, one for each headerID. You can create the order by first finding the beginning row for each chain (I'll assume begin_ID is null), and give them orderNo 1. Then in a loop, find the rows that should be the next orderno, and allocate that to them, until you don't find any more. If you have any rows at the end that don't have an orderNo allocated, then you have data problems.

--Set up the work table
DECLARE @Work TABLE (Sub_ID int, orderNo int);

-- Set up the start of each order list
insert into @Work (sub_ID, orderNo)
Select sub_ID, 1
from table2 
where begin_ID is null;

DECLARE @Finished int = 0;  --Flag to see if we're done
DECLARE @NextOrder int = 2; --Next order number to process

While @Finished = 0
BEGIN
    -- add the next level for all the order lists
    insert into @Work (sub_ID, orderNo)
    Select t2.sub_ID, @NextOrder
    From table2 t2
        inner join @Work w on w.sub_ID = t2.begin_ID       -- We want rows that are next in an order chain
        left outer join @Work w2 on w2.sub_ID = t2.sub_ID  -- and haven't already been done (to avoid loops)
    Where w2.Sub_ID is null;

    IF @@ROWCOUNT = 0 SET @Finished = 1;  --flag if nothing was updated (so stop)
    SET @NextOrder = @NextOrder + 1;  -- next order level to add
END;

--example usage of order table.  Note that any records where w.sub_id is null means
-- that record was not in a reachable order list (either the table2 record does not
-- exist, or the order list walk never reached it).
Select t1.*
from table1 t1
    left outer join @Work w on w.sub_ID = t1.sub_id
order by T1.head_id, w.orderNo

You could also do this by using a recursive CTE, but the theory would be the same. This would let you do it all in one query, if IMPALA needs to work that way.

DancingFool
  • 1,247
  • 1
  • 8
  • 10