0

I have two tables with the below example schemas. The keys for table A are nested in a list in table B. I would like to join table A and table B based on the table A keys to generate table C. The values from table A should be a nested structure in table C based on the list of keyAs in table B. How can I do this using pyspark? Thanks!

Table A

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyA: string (nullable = true) 

Table B

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true)

Table C

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- valueAs: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- item1: string (nullable = true) 
| | |-- item2: long (nullable = true) 
| | |-- keyA: string (nullable = true)
paul62285
  • 21
  • 5

1 Answers1

1

For joining A and B you need to explode B.keyAs first, like this:

tableB.withColumn('keyA', explode('keyAs')).join(tableA, 'keyA')

For creating a nested structure please see this answer

Mariusz
  • 13,481
  • 3
  • 60
  • 64