3

I have a data set like this:

name  time val
---- ----- ---
fred 04:00 111
greg 03:00 123
fred 01:00 411
fred 05:00 921
fred 11:00 157
greg 12:00 333

And csv files in some folder, one for each unique name from the data set:

fred.csv
greg.csv

The contents of fred.csv, for example, looks like this:

00:00 222
10:00 133

My goal is to efficiently merge the dataset to the CSV's in sorted time order so that fred.csv, for example, ends up like this:

00:00 222
01:00 411
04:00 111
05:00 921
10:00 133

In reality, there are thousands of unique names, not just two. I use union and sort functions to add rows in order, but I have not been successful with partitionBy, for each, or coalesce in getting the rows to their proper CSV files.

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
Greg Clinton
  • 365
  • 1
  • 7
  • 18

1 Answers1

2

Import and declare necessary variables

val spark = SparkSession.builder
  .master("local")
  .appName("Partition Sort Demo")
  .getOrCreate;

import spark.implicits._

Create dataframe from source file

val df = spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("csv/file/location")

//df.show()
+----+-----+---+
|name| time|val|
+----+-----+---+
|fred|04:00|111|
|greg|03:00|123|
|fred|01:00|411|
|fred|05:00|921|
|fred|11:00|157|
|greg|12:00|333|
+----+-----+---+

Now repartition dataframe by name and sort each partition then save them

//repartition
val repartitionedDf = df.repartition($"name")

for {
  //fetch the distinct names in dataframe use as filename
  distinctName <- df.dropDuplicates("name").collect.map(_ (0))
} yield {
  import org.apache.spark.sql.functions.lit

  repartitionedDf.select("time", "val")
    .filter($"name" === lit(distinctName)) //filter df by name
    .coalesce(1)
    .sortWithinPartitions($"time") //sort
    .write.mode("overwrite").csv("location/" + distinctName + ".csv") //save
}

Note:

The content of CSV file is available in highlighted files.

Output files location

Community
  • 1
  • 1
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • @GregClinton: regarding 4th comment, in Spark, we can't save the output as files, alternativity we rename/merge files inside the folder by running another job on output files. Here we rename part-** files to same as the folder name but in another location. – mrsrinivas Mar 16 '17 at 16:56
  • @GregClinton: regarding 1st comment, if we want complete all in one go we can save as one filename(of course it's also folder) by running `save` on `repartitionedDf` – mrsrinivas Mar 16 '17 at 17:02