Pyspark: Iterating through multiline nested json to build a dataframe

Question

Guys i needed some help to iterate through the following json in pyspark... and a build a dataframe:

{
    "success": true,
    "result": {
        "0x00e01a648ff41346cdeb873182383333d2184dd1": {
            "id": 130,
            "name": "xn--mytherwallet-fvb.com",
            "url": "http://xn--mytherwallet-fvb.com",
            "coin": "ETH",
            "category": "Phishing",
            "subcategory": "MyEtherWallet",
            "description": "Homoglyph",
            "addresses": [
                "0x00e01a648ff41346cdeb873182383333d2184dd1",
                "0x11e01a648ff41346cdeb873182383333d2184dd1"
            ],
            "reporter": "MyCrypto",
            "status": "Offline"
        },
        "0x858457daa7e087ad74cdeeceab8419079bc2ca03": {
            "id": 1200,
            "name": "myetherwallet.in",
            "url": "http://myetherwallet.in",
            "coin": "ETH",
            "category": "Phishing",
            "subcategory": "MyEtherWallet",
            "addresses": ["0x858457daa7e087ad74cdeeceab8419079bc2ca03"],
            "reporter": "MyCrypto",
            "ip": "159.8.210.35",
            "nameservers": [
                "ns2.eftydns.com",
                "ns1.eftydns.com"
            ],
            "status": "Active"
        }
    }
}

I need to build a dataframe which represents a list of addresses.

score 2 · Answer 1 · edited Dec 06 '20 at 07:26

I formatted your JSON to SPARK-Readable format.

{"success": true, "result": {"0x00e01a648ff41346cdeb873182383333d2184dd1": {"id": 130, "name": "xn--mytherwallet-fvb.com", "url": "http://xn--mytherwallet-fvb.com", "coin": "ETH", "category": "Phishing", "subcategory": "MyEtherWallet", "description": "Homoglyph", "addresses": ["0x00e01a648ff41346cdeb873182383333d2184dd1", "0x11e01a648ff41346cdeb873182383333d2184dd1"], "reporter": "MyCrypto", "status": "Offline"}, "0x858457daa7e087ad74cdeeceab8419079bc2ca03": {"id": 1200, "name": "myetherwallet.in", "url": "http://myetherwallet.in", "coin": "ETH", "category": "Phishing", "subcategory": "MyEtherWallet", "addresses": ["0x858457daa7e087ad74cdeeceab8419079bc2ca03"], "reporter": "MyCrypto", "ip": "159.8.210.35", "nameservers": ["ns2.eftydns.com", "ns1.eftydns.com"], "status": "Active"}}}

Read the JSON

val df = spark.read.json("/my_data.json")

df.printSchema()
df.show(false)

Output

root
 |-- result: struct (nullable = true)
 |    |-- 0x00e01a648ff41346cdeb873182383333d2184dd1: struct (nullable = true)
 |    |    |-- addresses: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- coin: string (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- reporter: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subcategory: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |-- 0x858457daa7e087ad74cdeeceab8419079bc2ca03: struct (nullable = true)
 |    |    |-- addresses: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- coin: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- ip: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- nameservers: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- reporter: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subcategory: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- success: boolean (nullable = true)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                     |success|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|[[WrappedArray(0x00e01a648ff41346cdeb873182383333d2184dd1, 0x11e01a648ff41346cdeb873182383333d2184dd1),Phishing,ETH,Homoglyph,130,xn--mytherwallet-fvb.com,MyCrypto,Offline,MyEtherWallet,http://xn--mytherwallet-fvb.com],[WrappedArray(0x858457daa7e087ad74cdeeceab8419079bc2ca03),Phishing,ETH,1200,159.8.210.35,myetherwallet.in,WrappedArray(ns2.eftydns.com, ns1.eftydns.com),MyCrypto,Active,MyEtherWallet,http://myetherwallet.in]]|true   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

mck · Answer 2 · 2020-12-06T07:38:10.237

1

In Pyspark you can do below. No need to reformat your json - it's perfectly formatted, and you just need to pass in .option('multiline', True) to the json reader.

df = spark.read.option('multiline', True).json('test.json')

To get the addresses:

import pyspark.sql.functions as F

df2 = df.select('result.*')
df3 = df2.select(
    F.explode(
        F.concat(
            *[F.col(f'{col}.addresses') for col in df2.columns]
        )
    ).alias('addresses')
)

df3.show(truncate=False)
+------------------------------------------+
|addresses                                 |
+------------------------------------------+
|0x00e01a648ff41346cdeb873182383333d2184dd1|
|0x11e01a648ff41346cdeb873182383333d2184dd1|
|0x858457daa7e087ad74cdeeceab8419079bc2ca03|
+------------------------------------------+

edited Dec 06 '20 at 07:38

answered Dec 06 '20 at 07:29

mck

40,932
13
35
50

Thanks for the comment... I am close, still doesn't work. I seem to be getting a syntax error... >>> df3 = df2.select(F.explode(F.concat(*[F.col(f'{col}.addresses') for col in df2.columns])).alias('addresses')) File "", line 1 df3 = df2.select(F.explode(F.concat(*[F.col(f'{col}.addresses') for col in df2.columns])).alias('addresses')) ^ SyntaxError: invalid syntax – Ropanb Dec 07 '20 at 00:01
@Ropanb are you using old versions of Python? if so then try to replace `f'{col}.addresses'` with `'%s.addresses' % (col)` – mck Dec 07 '20 at 07:50

Pyspark: Iterating through multiline nested json to build a dataframe

2 Answers2