How to apply custom logics on spark dataframe using scala

Question

Imagine a data for this question is in a nested json structure. I have flattened the data from json using explode() and added it into one data-frame with columns project, Task, Task-Evidence, Task-Remarks, Project-Evidence

*Note: This DF is having 1 Project for which it has 2 Tasks, for the first task it has 1 task-link, for the second task it has 1 task-link. at project level we have 3 project-links.

Result of DF

Expected Result

your expected df isn't a valid table. The JSON which you already have is perfect for what you want to achieve, You can create a dataframe with JSON column where each rows is one project and use in your code, instead of exploding and trying to create the "JSONish" structure again — Equinox, May 25 '23 at 07:40

Islam Elbanna · Answer 1 · 2023-05-25T09:27:06.293

0

AFAIU, If you flattened the json, then you just need to group tasks, task-evidence, ... by Project, so you can group by project and use collect_set, something thing like this:

import org.apache.spark.sql.functions._

val df2 = df.groupBy("project").agg(
            collect_set("Task").as("Tasks"),
            collect_set("Task-Evidence").as("Task-Evidences"),
            collect_set("Task-Remarks").as("Task-Remarks"),
            collect_set("Project-Evidence").as("Project-Evidences")
        )

edited May 25 '23 at 09:27

answered May 25 '23 at 09:06

Islam Elbanna

1,438
2
9
15

But this just groups multiple elements in columns into an list – Vivek Gowda May 26 '23 at 05:34
This is what i understand from your expected results, please le me know if this is not the case – Islam Elbanna May 26 '23 at 05:54

How to apply custom logics on spark dataframe using scala

1 Answers1