0

I'm sending JSON data from Apache Spark / Databricks to an API. The API is expecting the data in the following JSON format:

Sample:
{
  "CtcID": 1,
  "LastName": "sample string 2",
  "CpyID": 3,
  "HonTitle": "sample string 4",
  "PositionCode": 1,
  "PositionFreeText": "sample string 6",
  "CreateDate": "2021-04-21T08:50:56.8602571+01:00",
  "ModifDate": "2021-04-21T08:50:56.8602571+01:00",
  "ModifiedBy": 1,
  "SourceID": "sample string 9",
  "OriginID": "sample string 10",
  "DoNotExport": true,
  "ParentEmailAddress": "sample string 13",
  "SupInfo": [
    {
      "FieldName": "sample string 1",
      "DATA_TYPE": "sample string 2",
      "IS_NULLABLE": "sample string 3",
      "FieldContent": "sample string 4"
    },
    {
      "FieldName": "sample string 1",
      "DATA_TYPE": "sample string 2",
      "IS_NULLABLE": "sample string 3",
      "FieldContent": "sample string 4"
    }
  ],

I'm sending the data in the following JSON format:

{"Last_name":"Finnigan","First_name":"Michael","Email":"MichaelF@email.com"}
{"Last_name":"Phillips","First_name":"Austin","Email":"PhillipsA@email.com"}
{"Last_name":"Collins","First_name":"Colin","Email":"ColinCollins@email.com"}
{"Last_name":"Finnigan","First_name":"Judy","Email":"Judy@email.com"}
{"Last_name":"Jones","First_name":"Julie","Email":"Julie@email.com"}
{"Last_name":"Smith","First_name":"Barry","Email":"Barry@email.com"}
{"Last_name":"Kane","First_name":"Harry","Email":"Harry@email.com"}
{"Last_name":"Smith","First_name":"John","Email":"John@email.com"}
{"Last_name":"Colins","First_name":"Ruby","Email":"RubySmith@email.com"}
{"Last_name":"Tests","First_name":"Smoke","Email":"a.n.other@pret.com"}

The code in Apache Spark is as follows:

url = 'https://enimuozygj4jqx.m.pipedream.net'
files = spark.read.json("abfss://azurestorageaccount.dfs.core.windows.net/PostContact.json")

r = requests.post(url, data=json.dumps(files))
print(r.status_code)

When I execute the code I get the following error:

TypeError: Object of type DataFrame is not JSON serializable

Patterson
  • 1,927
  • 1
  • 19
  • 56

1 Answers1

1

Dataframe is a set of Row objects, and you can't do json.dumps on it. You can do something like this:

from pyspark.sql.functions import * 

files_df = spark.read.json("...")
rows = files_df.select(to_json(struct('*')).alias("json")).collect()
files = [json.loads(row[0]) for row in rows]
r = requests.post(url, data=json.dumps(files))

this code takes the Dataframe and converts every row into a struct (using the struct function) (like a Python dict) and then convert that object into JSON string (via to_json). And then we're converting that into the Python list of objects for which you can use json.dumps

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • thanks. I'm going to try your suggestion. – Patterson Apr 21 '21 at 16:20
  • this worked absolutely amazingly. Absolutely brilliant. Thank you so much – Patterson Apr 21 '21 at 16:29
  • Hi @alex, can I just ask for some additional help with this. We need to submit a token each time with your code to authenticate to the API server. Can you advise on where in the we could incorporate the token – Patterson Apr 21 '21 at 17:10
  • @Patterson Follow this [link](https://stackoverflow.com/questions/29931671/making-an-api-call-in-python-with-an-api-that-requires-a-bearer-token) for working with tokens – Yayati Sule Apr 21 '21 at 17:16
  • just follow `requests` package documentation. You can keep the token in the Databricks secrets for additional security – Alex Ott Apr 21 '21 at 17:49
  • @YayatiSule, thanks for the link. I will investigate and let you know if the suggestion provided helped. Thanks – Patterson Apr 22 '21 at 07:44
  • @YayatiSule, the link you provided solved the issue with authenticating to API server. Thanks – Patterson Apr 22 '21 at 13:29