pyspark: join tables based on nested keys

Question

I have two tables with the below example schemas. The keys for table A are nested in a list in table B. I would like to join table A and table B based on the table A keys to generate table C. The values from table A should be a nested structure in table C based on the list of keyAs in table B. How can I do this using pyspark? Thanks!

Table A

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyA: string (nullable = true)

Table B

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true)

Table C

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- valueAs: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- item1: string (nullable = true) 
| | |-- item2: long (nullable = true) 
| | |-- keyA: string (nullable = true)

score 1 · Answer 1 · answered Nov 13 '17 at 17:56

1

For joining A and B you need to explode B.keyAs first, like this:

tableB.withColumn('keyA', explode('keyAs')).join(tableA, 'keyA')

For creating a nested structure please see this answer

answered Nov 13 '17 at 17:56

Mariusz

13,481
3
60
64

pyspark: join tables based on nested keys

1 Answers1