Read multiple CSVs with different headers into one single dataframe

Question

I have a few CSV files where some files might have some matching columns and some have altogether different columns. For Example file 1 has the following columns:

['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng', 'alt', 'url']

and file2 has the following columns :

['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']

I want to create a dataframe that will have all these columns. I wrote the following code hoping that mapping between pre-defined schema and CSV file headers will automatically take place, but it did not work out.

sch=StructType([StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('raceId',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True),
StructField('time',StringType(),True)
])

df=spark.read \
        .option('header','true') \
        .schema(sch) \
        .csv('/FileStore/Udemy/Formula_One_Raw/*.csv')

with this code I am getting the following output :

Did you try creating two dataframes and then join them on circuitId column. — ozlemg, Nov 24 '22 at 06:34
Could you add a table of your desired output with some sample data in it? — ozlemg, Nov 24 '22 at 09:45

score 0 · Answer 1 · answered Nov 25 '22 at 18:32

The CSV format uses positions, not names, of the columns to resolve the schema, so if you have a file with 8 columns, like:

['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']

and the first 8 fields of schema you are trying to apply are:

StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True)

then raceId will be inferred as circuitId, round as name and so on. You can do the following thing to resolve this:

Create separate schema for every distinct file, taking into account what columns are there, and add the ones that are not part of this particular schema at the end - this way they should be included in your DataFrame, and filled with NULL.

So for the example CSV schema at the top, you can declare the schema like so:

StructField('raceId',StringType(),True),
StructField('year',StringType(),True),
StructField('round',StringType(),True),
StructField('circuitId',StringType(),True),
StructField('name',StringType(),True),
StructField('date',StringType(),True),
StructField('time',StringType(),True),
StructField('url',StringType(),True),
StructField('alt',StringType(),True),
StructField('lng',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True)

And your DataFrame should look like so:

+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
|raceId|year|round|circuitId|                name|      date|    time|                 url| alt| lng|circuitRef| lat|location|country|
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
|     1|2009|    1|        1|Australian Grand ...|2009-03-29|06:00:00|http://en.wikiped...|null|null|      null|null|    null|   null|
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+

After all your data has been read to DataFrames, you will be able to union them using column names, if needed.

Read multiple CSVs with different headers into one single dataframe

1 Answers1