0

this seems like a simple thing to do but nevertheless I am kind of stuck with this assignment.

I have a pyspark dataframe (created by reading a JSON file). It has almost 1000 column names, each column name referring to the unique identifier of the JSON file. The value of each column name represents the actual contents of the JSON file.

The dataframe now looks like this:
|json_file_1|json_file_2|json_file_3|json_file_4|
|:----------|:----------|:----------|:----------|
|json_content |json_content|json_content|json_content|

I want it convert into something like this, where each json_file name is transposed into a value of a to be created column 'id':

|id|json_content|
|:-|:-----------|
|json_file_1|json_content|
|json_file_2|json_content|
|json_file_3|json_content|
|json_file_4|json_content|

Any suggestions on how to do this most effectively? I have been studying some cases where "melt" is suggested but did not find shown examples where so different that I could not deploy them easily.

Note: I could simply copy and paste the table to excel and transpose it - but I don't want to go the easy route

doneforaiur
  • 1,308
  • 7
  • 14
  • 21
Antonius
  • 67
  • 9

1 Answers1

0

So instead of a pivot you want a melt type function. Unfortunately as far as I know, there is no inbuilt melt function for pyspark. You can get around it with the following:

from pyspark.sql import SQLContext
from pyspark.sql import functions as F

unpivotExpr = "stack(1000, 'json_file_1', json_file_1, 'json_file_2', json_file_2, 
'json_file_3', json_file_3......n) as (json_file, json_data)"


df_melted = df.select(F.expr(unpivotExpr))
Lex D
  • 11
  • 1
  • unfortunately it does not seem to work as I get an AnalysisException: "The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 1 aliases but got json_file,json_data". When I try this using only the five first columns and change the variable from 1000 into 5, I get the same exception – Antonius Aug 09 '23 at 10:02