0

I have a few CSV files where some files might have some matching columns and some have altogether different columns. For Example file 1 has the following columns:

['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng', 'alt', 'url']

and file2 has the following columns :

['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']

I want to create a dataframe that will have all these columns. I wrote the following code hoping that mapping between pre-defined schema and CSV file headers will automatically take place, but it did not work out.

sch=StructType([StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('raceId',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True),
StructField('time',StringType(),True)
])

df=spark.read \
        .option('header','true') \
        .schema(sch) \
        .csv('/FileStore/Udemy/Formula_One_Raw/*.csv')

with this code I am getting the following output :

enter image description here

Ankit Tyagi
  • 175
  • 2
  • 17

1 Answers1

0

The CSV format uses positions, not names, of the columns to resolve the schema, so if you have a file with 8 columns, like:

['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']

and the first 8 fields of schema you are trying to apply are:

StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True)

then raceId will be inferred as circuitId, round as name and so on. You can do the following thing to resolve this:

  • Create separate schema for every distinct file, taking into account what columns are there, and add the ones that are not part of this particular schema at the end - this way they should be included in your DataFrame, and filled with NULL.

So for the example CSV schema at the top, you can declare the schema like so:

StructField('raceId',StringType(),True),
StructField('year',StringType(),True),
StructField('round',StringType(),True),
StructField('circuitId',StringType(),True),
StructField('name',StringType(),True),
StructField('date',StringType(),True),
StructField('time',StringType(),True),
StructField('url',StringType(),True),
StructField('alt',StringType(),True),
StructField('lng',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True)

And your DataFrame should look like so:

+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
|raceId|year|round|circuitId|                name|      date|    time|                 url| alt| lng|circuitRef| lat|location|country|
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
|     1|2009|    1|        1|Australian Grand ...|2009-03-29|06:00:00|http://en.wikiped...|null|null|      null|null|    null|   null|
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+

After all your data has been read to DataFrames, you will be able to union them using column names, if needed.

Bartosz Gajda
  • 984
  • 6
  • 14