Below I provide my schema and the code that I use to read from partitions in hdfs.
An example of a partition could be this path: /home/maria_dev/data/key=key/date=19 jan
(and of course inside this folder there's a csv file that contains cnt
)
So, the data I have is partitioned by key
and date
columns.
When I read it like below the columns are not properly read, so cnt
gets read into date
and vice versa.
How can I resolve this?
private val tweetSchema = new StructType(Array(
StructField("date", StringType, nullable = true),
StructField("key", StringType, nullable = true),
StructField("cnt", IntegerType, nullable = true)
))
// basePath example: /home/maria_dev/data
// path example: /home/maria_dev/data/key=key/data=19 jan
private def loadDF(basePath: String, path: String, format: String): DataFrame = {
val df = spark.read
.schema(tweetSchema)
.format(format)
.option("basePath", basePath)
.load(path)
df
}
I tried changing their order in the schema from (date, key, cnt)
to (cnt, key, date)
but it does not help.
My problem is that when I call union
, it appends 2 dataframes:
- df1:
{(key: 1, date: 2)}
- df2:
{(date: 3, key: 4)}
into the final dataframe like this: {(key: 1, date: 2), (date: 3, key: 4)}
. As you can see, the columns are messed up.