1

I'm trying to add comments to the field (Schema With Data Definitions), below is the implementation I'm trying.

Tried to with StructType.add() (code in comments) and also with StructType([ StructField("filed",dtype,boolean,metadata )]

got below error. Not sure this implementation works, Can someone help me here, I'm new to spark.

I'm looking for output(Schema With Data Definitions) like

df.printSchema()

root
 |-- firstname: string (nullable = true) comments:val1
 |-- middlename: string (nullable = true) comments:val2
 |-- lastname: string (nullable = true) comments:val3
 |-- id: string (nullable = true) comments:val4
 |-- gender: string (nullable = true) comments:val5
 |-- salary: integer (nullable = true) comments:val6

error:

IllegalArgumentException: Failed to convert the JSON string '{"metadata":"val1","name":"firstname","nullable":true,"type":"string"}' to a field.

Code Which I'm trying to add comments to the field:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True,'val1'), \
    StructField("middlename",StringType(),True,'val2'), \
    StructField("lastname",StringType(),True,'val3'), \
    StructField("id", StringType(), True,'val4'), \
    StructField("gender", StringType(), True,'val5'), \
    StructField("salary", IntegerType(), True,'val6') \
  ])


# schema= StructType().add("firstname",StringType(),True,'val1').add("middlename",StringType(),True,'val2') \
.add("lastname",StringType(),True,'val3').add("id", StringType(), True,'val4').add("gender", StringType(), True,'val5').add("salary", IntegerType(), True,'val6')
          
         
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
Sri
  • 35
  • 7

1 Answers1

0

StructField's metadata parameter needs an argument of a dictionary object. It would be something like this

StructField("firstname", StringType(), True, {"comment":"val1"})
AdibP
  • 2,819
  • 1
  • 10
  • 24
  • As recommended i have update as ' StructField("firstname",StringType(),True,{"comment":"val1"})' and applied same to other columns . still i'm seeing same when i do df.printSchema() "root |-- firstname: string (nullable = true)" – Sri Aug 19 '21 at 05:40
  • `.printSchema()` doesn't show the metadata of the schema. As an alternative, you can use `df.schema.jsonValue()` to show the schema an its metadata in json format. – AdibP Aug 19 '21 at 05:52