0

Like below in amazon Athena.

DATA

ID , CODE 
a  , 123
b  , 345
a  , 123
a  , 345
c  , 246
d  , 678

Grouped Data

I have grouped above data with ID and CODE to get the frequency/count of the combination as below.

ID , CODE ,FREQ
a  , 123  , 2
a  , 345  , 1
b  , 345  , 1
c  , 246  , 1
d  , 678  , 1

Required Transformation

Required Transformation of the data : I am trying to get pivot the rows in CODE column into header row and create a matrix/data frame.

ID , 123 ,345 ,246, 678
a  ,   2 ,  1 , 0 , 0
b  ,   0 ,  1 , 0 , 0
c  ,   0 ,  0 , 1 , 0
d  ,   0 ,  0 , 0 , 1

On smaller data set I have successfully used R "Table" command and it works okay. But Now I have 27M unique IDs with 300 unique columns, with 220 M rows. so my ultimate matrix will have 27Mrows X 350 columns

My questions:

  1. Is there a query in Athena that I can write to achieve this result.
  2. Can I use R ? But Table command wont support it need to find a library to parallelize the data.Then use reshape2 package. Even that I am not sure how to go about.
  3. Is spark a better solution.If so how do i do it. I have already set up spark on a EC2 instance and downloaded the grouped data from S3 to EC2 tmp folder as a CSV its 8GB data file.
  4. Should I operate on the original data set or use the grouped by data set.

Kindly give me pointers. I am a newbie in all these technologies and figuring this out.

pault
  • 41,343
  • 15
  • 107
  • 149
Pavan
  • 1
  • 3
  • What you want is to reshape the data from long to wide. Have you looked at reshape2 package? – Merik Aug 16 '18 at 19:23
  • I see pyspark tagged. You can use `.pivot` to transpose up the codes to columns. Read about it. – samkart Aug 16 '18 at 19:29
  • @Merik thanks I have tried reshape2 but on a local machine with 16GB Ram it failed after 1M rows, I can try the same on EC2 instance I just completed installing R. – Pavan Aug 16 '18 at 19:48
  • @samkart will look into .pivot in pyspark .. last night completed setting up pyspark on a EC instance. – Pavan Aug 16 '18 at 19:49
  • The only other idea I have is to use `dcast()` from `data.table` as descrided in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html – Merik Aug 17 '18 at 01:37
  • @Merik do you know how to parallelize dcast because its failing after 5M rows. I get " Opened 4.347GB (4667247189 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available." – Pavan Aug 17 '18 at 17:00
  • I accomplished this finally with writing a hardcoded query in Athena to pivot by giving exactly the name of the column I would want. I used this link https://community.modeanalytics.com/sql/tutorial/sql-pivot-table/ .. Thanks for the pointers.. – Pavan Aug 17 '18 at 18:34

0 Answers0