0

Below I provide my schema and the code that I use to read from partitions in hdfs.

An example of a partition could be this path: /home/maria_dev/data/key=key/date=19 jan (and of course inside this folder there's a csv file that contains cnt)

So, the data I have is partitioned by key and date columns.

When I read it like below the columns are not properly read, so cnt gets read into date and vice versa.

How can I resolve this?

private val tweetSchema = new StructType(Array(
    StructField("date", StringType, nullable = true),
    StructField("key", StringType, nullable = true),
    StructField("cnt", IntegerType, nullable = true)
  ))

// basePath example: /home/maria_dev/data
// path example: /home/maria_dev/data/key=key/data=19 jan
private def loadDF(basePath: String, path: String, format: String): DataFrame = {
    val df = spark.read
      .schema(tweetSchema)
      .format(format)
      .option("basePath", basePath)
      .load(path)
    df
}

I tried changing their order in the schema from (date, key, cnt) to (cnt, key, date) but it does not help.

My problem is that when I call union, it appends 2 dataframes:

  • df1: {(key: 1, date: 2)}
  • df2: {(date: 3, key: 4)}

into the final dataframe like this: {(key: 1, date: 2), (date: 3, key: 4)}. As you can see, the columns are messed up.

pavel_orekhov
  • 1,657
  • 2
  • 15
  • 37

2 Answers2

1

The schema should be in the following order:

  • Columns present in the data files as such - in case of CSV in the natural order from left to right.
  • Columns used with partitioning in the same order as defined by the directory structure.

So in your case the correct order will be:

new StructType(Array(
  StructField("cnt", IntegerType, nullable = true),
  StructField("key", StringType, nullable = true),
  StructField("date", StringType, nullable = true)
))
0

It turns out that everything was read properly.

So, now, instead of doing df1.union(df2), I do df1.select("key", "date").union(df2.select("key", "date")) and it works.

pavel_orekhov
  • 1,657
  • 2
  • 15
  • 37