-1

Imagine a data for this question is in a nested json structure. I have flattened the data from json using explode() and added it into one data-frame with columns project, Task, Task-Evidence, Task-Remarks, Project-Evidence

*Note: This DF is having 1 Project for which it has 2 Tasks, for the first task it has 1 task-link, for the second task it has 1 task-link. at project level we have 3 project-links.

Result of DF

Expected Result

  • your expected df isn't a valid table. The JSON which you already have is perfect for what you want to achieve, You can create a dataframe with JSON column where each rows is one project and use in your code, instead of exploding and trying to create the "JSONish" structure again – Equinox May 25 '23 at 07:40

1 Answers1

0

AFAIU, If you flattened the json, then you just need to group tasks, task-evidence, ... by Project, so you can group by project and use collect_set, something thing like this:

import org.apache.spark.sql.functions._

val df2 = df.groupBy("project").agg(
            collect_set("Task").as("Tasks"),
            collect_set("Task-Evidence").as("Task-Evidences"),
            collect_set("Task-Remarks").as("Task-Remarks"),
            collect_set("Project-Evidence").as("Project-Evidences")
        )
Islam Elbanna
  • 1,438
  • 2
  • 9
  • 15