6

Let's suppose we have a table:

Owner   | Pets
------------------------------
Jack    | "dog, cat, crocodile" 
Mary    | "bear, pig"

I want to get as a result:

Owner   | Pets
------------------------------
Jack    | "dog"
Jack    | "cat"
Jack    | "crocodile" 
Mary    | "bear"
Mary    | "pig"

I found some solutions to similar problems by googling, but Impala SQL does not offer any of these capabilities to apply the suggested solutions.

Any help would be greatly appreciated!

ifotopoulos
  • 83
  • 1
  • 1
  • 3
  • Can you add some details of what you tried and what the results were, so that people don't duplicate your failed efforts? – shoover May 23 '16 at 23:01
  • 2
    What you want to do is "flatten" a nested data structure, and that nested thing violates the whole entity-relation concept. So it cannot be done with **standard** SQL; only with database-specific extensions, which may or may not exist, and will differ widely from database to database. Sorry, Impala was not designed for this kind of ugly ETL stuff; you are supposed to prepare the data with Pig or Hive or whatever, load it in a Hive table, then expose it to Impala with `REFRESH ` – Samson Scharfrichter May 24 '16 at 08:28

4 Answers4

5

The following works in Impala:

split_part(string source, string delimiter, bigint n)

You can find the documentation here:

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_string_functions.html

Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136
  • 3
    This solution would only work for one token at a time. It doesn't cleanly flatten the values into individual rows without an explicit call to each index position. Technically this could be a workaround, but will be quite verbose and messy. – eperks Apr 28 '20 at 22:59
1

You have to do this with Hive SQL.

So lets create the impala table

-- imapla sql
CREATE TABLE IF NOT EXISTS tmp.my_example (
  `Owner` VARCHAR,
  Pets VARCHAR
);

INSERT INTO tmp.my_example (`Owner`, `Pets`) VALUES
('Jack', 'dog, cat, crocodile'),
('Mary', 'bear, pig');

Then use hive sql to split the column to rows:

-- hive sql
select f.owner as owner, t1.pos as pos, t1.val AS pet
from tmp.my_example f
lateral view posexplode(split(pets,', ')) t1 

If you save your results as a new table don't forget to run refresh new_table in impala so that your new table can be used

P.S. Hive is incredible slow for this small table but performs reasonable for real data

fc9.30
  • 2,293
  • 20
  • 19
  • This is the solution i think. Fallow this Link https://www.edureka.co/community/30472/hadoop-hive-how-to-split-a-single-row-into-multiple-rows – Rajitha Fernando Mar 18 '21 at 17:44
1

Try this one in Impala:

    select t,split_part(t,',',RT) cc
    from (
    select '12,33,4' t
    union all 
    select '12,23'   t
    ) T1
    inner join(
    select row_number() over(partition by 1 order by 1 ) RT
    from test.any_table_name limit 10) T2
    where T2.RT<=length(t) - length(replace(t,',','')) +1
    order by t,RT

enter image description here

Lele Han
  • 31
  • 3
0

I want to improve a bit the answer from @Lele Han so it's clearer how to adapt it to your use case:

SELECT Owner
    , split_part(Pets, '|', PetPos) AS Pet
FROM (
    SELECT Owner, Pets
        , length(Pets) - length(replace(Pets, ',', '')) + 1 AS PetNumber
    FROM your_table
) t1
INNER JOIN (
    SELECT row_number() over(partition by 1 order by 1) PetPos
    FROM your_table
    LIMIT 100  -- a reasonable limit of number of pets
) t2
WHERE t2.PetPos <= t1.PetNumber
ronkov
  • 1,263
  • 9
  • 14