4

I have a big data problem in Hive (SQL).

SELECT genre, COUNT(*) AS unique_count
FROM table_name
GROUP BY genre

which gives result like:

genre           |   unique_count
----------------------------------
Romance,Crime,Drama,Law | 1560
Crime,Drama,Law,Romance | 895
Law,Romance,Crime,Drama | 942
Adventure,Action        | 3250
Action,Adventure        | 910

What I want is to sort the elements in genre ASC|DESC and get results like

genre           |   unique_count
----------------------------------
Crime,Drama,Law,Romance | 3397
Action,Adventure        | 4160

I could do this in Python but I have over 200 Million rows of data. I'm not aware of any reasonable way I can move that data. So how can I achieve this?

Afloz
  • 3,625
  • 3
  • 25
  • 31
  • Fix your data structure. Storing lists of things in comma-delimited strings just causes problems. If you didn't know this already, you have now learned why this is a bad idea. – Gordon Linoff Feb 10 '17 at 20:17
  • 4
    Gordon, I actually do know that but when join a company with existing data, there's little you can do than to massage the messy data. right? – Afloz Feb 10 '17 at 20:24

1 Answers1

10
select      concat_ws(',',sort_array(split(genre,','))) as genre
           ,count(*)                                    as unique_count

from        table_name

group by    concat_ws(',',sort_array(split(genre,',')))
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88