How can I produce the following nested json output (conversation
= parent, message
= child) using Databricks with pySpark or SQL
Desired output:
{
"conversation_id": 1,
"conversation_detail": "between mike and angie",
"messages": [
{
"message_id": 1,
"value": "hi mike"
},
{
"message_id": 2,
"value": "hi angie"
}
]
},
{
"conversation_id": 2,
"conversation_detail": "between tim and lisa",
"messages": [
{
"message_id": 1,
"value": "hi tim"
},
{
"message_id": 2,
"value": "hi lisa"
}
]
}
Basic output:
+---------------+----------------------+----------+---------------+--------+
|conversation_id|conversation_detail |message_id|conversation_id|value |
+---------------+----------------------+----------+---------------+--------+
|1 |between mike and angie|1 |1 |hi mike |
|1 |between mike and angie|2 |1 |hi angie|
|2 |between tim and lisa |3 |2 |hi tim |
|2 |between tim and lisa |4 |2 |hi lisa |
+---------------+----------------------+----------+---------------+--------+
Basic query:
SELECT *
FROM conversation c
JOIN messages m ON c.conversation_id = m.conversation_id
Setup:
CREATE OR REPLACE TABLE conversation
(
conversation_id INT,
conversation_detail STRING
);
CREATE OR REPLACE TABLE messages
(
message_id INT,
conversation_id INT,
value STRING
);
INSERT INTO conversation
VALUES
(1, 'between mike and angie'),
(2, 'between tim and lisa');
INSERT INTO messages
VALUES
(1, 1, 'hi mike'),
(2, 1, 'hi angie'),
(3, 2, 'hi tim'),
(4, 2, 'hi lisa');
What I've tried:
- Databricks
to_json
function
SQL Server's FOR JSON
function behaves like it want because it respects the parent/child relationship represented by the join.