16

My PySpark data frame has the following schema:

schema = spark_df.printSchema()
root
 |-- field_1: double (nullable = true)
 |-- field_2: double (nullable = true)
 |-- field_3 (nullable = true)
 |-- field_4: double (nullable = true)
 |-- field_5: double (nullable = true)
 |-- field_6: double (nullable = true)

I would like to add one more StructField to the schema, so the new schema would looks like:

root
 |-- field_1: double (nullable = true)
 |-- field_1: double (nullable = true)
 |-- field_2: double (nullable = true)
 |-- field_3 (nullable = true)
 |-- field_4: double (nullable = true)
 |-- field_5: double (nullable = true)
 |-- field_6: double (nullable = true)

I know I can manually create a new_schema like below:

new_schema = StructType([StructField("field_0", StringType(), True),
                            :
                         StructField("field_6", IntegerType(), True)])

This works for a small number of fields but couldn't generate if I have hundreds of fields. So I am wondering is there a more elegant way to add a new field to the beginning of the schema? Thanks!

zero323
  • 322,348
  • 103
  • 959
  • 935
Edamame
  • 23,718
  • 73
  • 186
  • 320

2 Answers2

27

You can copy existing fields and perpend:

to_prepend = [StructField("field_0", StringType(), True)] 

StructType(to_prepend + df.schema.fields)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    I got the following error: ----> 5 StructType(to_prepend + schema.fields) AttributeError: 'NoneType' object has no attribute 'fields' – Edamame Sep 18 '16 at 19:52
  • 2
    I mean if schema is really a schema. You execute `spark_df.printSchema()` which doesn't return useful value. – zero323 Sep 18 '16 at 20:33
  • in case you may need to prepend more than one field, you could also use the following: https://stackoverflow.com/questions/42959493/creating-spark-schema-for-glove-word-vector-files/50650639#50650639 – Quetzalcoatl Jun 01 '18 at 20:35
0

The question seems to ask how to prepend a field to a schema, but please note that if you just want to add a field then this can be achieved with the StructType.add(field) method. Eg.:

#define some schema
schema = StructType([
    StructField('Field 1', StringType(), True),
    StructField('Field 2', StringType(), True)
])
#add a field
schema.add('Field 3', StringType(), True)

#create empty dataframe from schema and test
df = spark.createDataFrame(data=[], schema=schema)
df.printSchema()
Wietla
  • 1
  • 1