How to pivot rows into columns in Athena when you dont have the entire list of values or if the list is too long

Question

Like below in amazon Athena.

DATA

ID , CODE 
a  , 123
b  , 345
a  , 123
a  , 345
c  , 246
d  , 678

Grouped Data

I have grouped above data with ID and CODE to get the frequency/count of the combination as below.

ID , CODE ,FREQ
a  , 123  , 2
a  , 345  , 1
b  , 345  , 1
c  , 246  , 1
d  , 678  , 1

Required Transformation

Required Transformation of the data : I am trying to get pivot the rows in CODE column into header row and create a matrix/data frame.

ID , 123 ,345 ,246, 678
a  ,   2 ,  1 , 0 , 0
b  ,   0 ,  1 , 0 , 0
c  ,   0 ,  0 , 1 , 0
d  ,   0 ,  0 , 0 , 1

On smaller data set I have successfully used R "Table" command and it works okay. But Now I have 27M unique IDs with 300 unique columns, with 220 M rows. so my ultimate matrix will have 27Mrows X 350 columns

My questions:

Is there a query in Athena that I can write to achieve this result.
Can I use R ? But Table command wont support it need to find a library to parallelize the data.Then use reshape2 package. Even that I am not sure how to go about.
Is spark a better solution.If so how do i do it. I have already set up spark on a EC2 instance and downloaded the grouped data from S3 to EC2 tmp folder as a CSV its 8GB data file.
Should I operate on the original data set or use the grouped by data set.

Kindly give me pointers. I am a newbie in all these technologies and figuring this out.

What you want is to reshape the data from long to wide. Have you looked at reshape2 package? — Merik, Aug 16 '18 at 19:23
I see pyspark tagged. You can use `.pivot` to transpose up the codes to columns. Read about it. — samkart, Aug 16 '18 at 19:29
@Merik thanks I have tried reshape2 but on a local machine with 16GB Ram it failed after 1M rows, I can try the same on EC2 instance I just completed installing R. — Pavan, Aug 16 '18 at 19:48
@samkart will look into .pivot in pyspark .. last night completed setting up pyspark on a EC instance. — Pavan, Aug 16 '18 at 19:49
The only other idea I have is to use `dcast()` from `data.table` as descrided in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html — Merik, Aug 17 '18 at 01:37
@Merik do you know how to parallelize dcast because its failing after 5M rows. I get " Opened 4.347GB (4667247189 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available." — Pavan, Aug 17 '18 at 17:00
I accomplished this finally with writing a hardcoded query in Athena to pivot by giving exactly the name of the column I would want. I used this link https://community.modeanalytics.com/sql/tutorial/sql-pivot-table/ .. Thanks for the pointers.. — Pavan, Aug 17 '18 at 18:34

How to pivot rows into columns in Athena when you dont have the entire list of values or if the list is too long

DATA

Grouped Data

Required Transformation

0 Answers0