I have a few CSV files where some files might have some matching columns and some have altogether different columns. For Example file 1 has the following columns:
['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng', 'alt', 'url']
and file2 has the following columns :
['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']
I want to create a dataframe that will have all these columns. I wrote the following code hoping that mapping between pre-defined schema and CSV file headers will automatically take place, but it did not work out.
sch=StructType([StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('raceId',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True),
StructField('time',StringType(),True)
])
df=spark.read \
.option('header','true') \
.schema(sch) \
.csv('/FileStore/Udemy/Formula_One_Raw/*.csv')
with this code I am getting the following output :