Why can't Spark properly load columns from HDFS?

Question

Below I provide my schema and the code that I use to read from partitions in hdfs.

An example of a partition could be this path: /home/maria_dev/data/key=key/date=19 jan (and of course inside this folder there's a csv file that contains cnt)

So, the data I have is partitioned by key and date columns.

When I read it like below the columns are not properly read, so cnt gets read into date and vice versa.

How can I resolve this?

private val tweetSchema = new StructType(Array(
    StructField("date", StringType, nullable = true),
    StructField("key", StringType, nullable = true),
    StructField("cnt", IntegerType, nullable = true)
  ))

// basePath example: /home/maria_dev/data
// path example: /home/maria_dev/data/key=key/data=19 jan
private def loadDF(basePath: String, path: String, format: String): DataFrame = {
    val df = spark.read
      .schema(tweetSchema)
      .format(format)
      .option("basePath", basePath)
      .load(path)
    df
}

I tried changing their order in the schema from (date, key, cnt) to (cnt, key, date) but it does not help.

My problem is that when I call union, it appends 2 dataframes:

df1: {(key: 1, date: 2)}
df2: {(date: 3, key: 4)}

into the final dataframe like this: {(key: 1, date: 2), (date: 3, key: 4)}. As you can see, the columns are messed up.

score 1 · Answer 1 · answered Jan 16 '19 at 14:52

The schema should be in the following order:

Columns present in the data files as such - in case of CSV in the natural order from left to right.
Columns used with partitioning in the same order as defined by the directory structure.

So in your case the correct order will be:

new StructType(Array(
  StructField("cnt", IntegerType, nullable = true),
  StructField("key", StringType, nullable = true),
  StructField("date", StringType, nullable = true)
))

score 0 · Accepted Answer · answered Jan 16 '19 at 15:12

0

It turns out that everything was read properly.

So, now, instead of doing df1.union(df2), I do df1.select("key", "date").union(df2.select("key", "date")) and it works.

answered Jan 16 '19 at 15:12

pavel_orekhov

1,657
2
15
37

Why can't Spark properly load columns from HDFS?

2 Answers2