0

I have 3 files in HDFS and would like to use most efficient way to sort them first on 1st column and then on 2nd column and store sorted result back to a new file on HDFS using Scala (or Python) in Spark 1.4.1:
hdfs:///test/2016/file.csv
hdfs:///test/2015/file.csv
hdfs:///test/2014/file.csv

Files look like this (no header):
hdfs:///test/2016/file.csv
127,56,abc
125,56,abc
121,56,abc

hdfs:///test/2016/file.csv
126,66,abc
122,56,abc
123,46,abc

hdfs:///test/2016/file.csv
122,66,abc
128,56,abc
123,16,abc

Sorted output want to save to HDFS:
hdfs:///test/output/file.csv
121,56,abc
122,56,abc
122,66,abc
123,16,abc
123,46,abc
125,56,abc
126,66,abc
127,56,abc
128,56,abc

I am very new to Spark and so far I only know how to load file:
val textFile = sc.textFile("hdfs:///test/2016/file.csv")

Tried to read on internet on how to sort but not clear on what libraries should work for this case (CSV files) and this version of Spark (1.4.1) and how to use them.. Please help, Joe

Joe
  • 11,983
  • 31
  • 109
  • 183

2 Answers2

1

I'd suggest to use databricks csv library for reading and writing csvs: https://github.com/databricks/spark-csv

As I had no access to hdfs now, this example uses the filesystem, but should also work when used with an hdfs path.

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions._  // needed for ordering the dataframe

object StackoverflowTest {

  def main(args: Array[String]) {
    // basic spark init
    val conf = new SparkConf().setAppName("Data Import from CSV")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    // first we load every file from the data directory that starts with 'file'
    val storeDf = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("inferSchema", "true")
      .load("data/file*")

    // then we sort it and write to an output
    storeDf
      .orderBy("C0", "C1")  // default column names
      .repartition(1)   // in order to have 1 output file
      .write
      .format("com.databricks.spark.csv")
      .save("data/output")
  }

}

The result will be written to data/output/part-00000 as csv. Hope this helps.

Daniel B.
  • 929
  • 4
  • 8
  • Is this for Spark version 1.4.1? I am using console and after this line I see Error: val sc = new SparkContext(conf) org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: org.apache.spark.SparkContext.(SparkContext.scala:81) org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) – Joe Mar 11 '16 at 21:16
  • That's because you already created a SparkContext in that shell - either close this shell session and start another one, or just skip the context creation and use the one you've already created. – Tzach Zohar Mar 11 '16 at 21:41
  • Yes, as @TzachZohar already noted, if you are running this in the shell, you dont need the val conf, val sc, val sqlContext lines, they are already initialized by default – Daniel B. Mar 11 '16 at 22:10
  • After doing .load("correct hdfs path") line - this is error: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) – Joe Mar 12 '16 at 01:01
  • Thats because it needs the databricks csv library, as the link mentions that i posted, if you are using it with the spark shell, you need to start the shell with spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 or spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 depending on if you are using scala 2.10 or 2.11 – Daniel B. Mar 12 '16 at 09:37
  • Tried both --packages and it did not work for me with Error: :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.databricks#spark-csv_2.10;1.4.0: not found :::::::::::::::::::::::::::::::::::::::::::::: – Joe Mar 14 '16 at 18:41
1
val textFile = sc.textFile("hdfs:///test/*/*.csv")
                 .map( _.split(",",-1) match { case Array(col1, col2, col3) => (col1, col2, col3) })
                 .sortBy(_._1)
                 .map(_._1+","+_._2+","+_._3)
                 .saveAsTextFile("hdfs:///testoutput/output/file.csv")

You will want to save in a different folder, otherwise the files you produce will be reused when you run it again.

Radu Ionescu
  • 3,462
  • 5
  • 24
  • 43
  • Thanks for sharing. I am using console and trying this.. Do I need to import any library before attempting this appraoch? How do you test if Sorting phase was good? (for me it is showing empty-no result after 3rd line - sortBy(_._1) when I do: textFile.foreach(println) – Joe Mar 11 '16 at 21:11
  • @Joe I had a typo. It should be `sc.textFile("hdfs:///test/*/*.csv")`. Did you correct for this? You do not need any libraries – Radu Ionescu Mar 11 '16 at 21:27
  • I saw small type and fixed but it still does not work for me.. Here is from my console: scala> .map(_.1+","+_._2+","+._3) :1: error: ')' expected but double literal found. res9.map(_.1+","+_._2+","+._3) – Joe Mar 11 '16 at 21:39
  • `res9.map(_._1+","+_._2+","+_._3)`. – Radu Ionescu Mar 11 '16 at 21:41
  • Can you run this `val textFile = sc.textFile("hdfs:///test/*/*.csv").count()` and tell me the results? – Radu Ionescu Mar 11 '16 at 21:43
  • Loading file(s) and path works good. Output is: textFile: Long = 10 – Joe Mar 11 '16 at 21:48
  • Seems like there is Empty array/object created after 3rd line: .sortBy(_._1) – Joe Mar 11 '16 at 21:49
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106056/discussion-between-radu-ionescu-and-joe). – Radu Ionescu Mar 11 '16 at 21:53
  • One update - tested on a CSV file with over 22 fields and it start complaining/failing due to some hard limitation of 22 items.. – Joe Mar 14 '16 at 18:36
  • This is something usual since [tuples are limited to 22](http://stackoverflow.com/questions/4152223/why-are-scala-functions-limited-to-22-parameters). If you still want to sort based on first column use `x => x.split(",",-1) match { case Array(col1, rescols @ _*) => (col1, x)}` and before save `map(_._2)` – Radu Ionescu Mar 14 '16 at 19:37