Iterating through files in scala to create values based on the file names

Question

I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.

My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.

Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.

You can see my code below which hopefully clearly shows what I am doing.

//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
  .map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))

//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
  file.flatMap { line =>
    if (!line.isEmpty && line(0) != '#') {
      val lineArray = line.split(",")
      if (lineArray.length < 0) {
        None
      } else {
        val srcId = lineArray(0).toInt
        val dstId = lineArray(1).toInt
        val ID = lineArray(2).toString
        (Array(Edge(srcId.toInt, dstId.toInt, ID)))
      }
    } else {
      None
    }
  }
}

//make graphs -This is where I want automation, so I can iterate through a 
//folder of edge files and output corresponding monthly graphs. 
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)

Any help or pointers on this would be much appreciated.

GameOfThrows · Answer 1 · 2016-02-05T11:18:10.807

you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc

   val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }

The Spark Context textFile only reads the data, but does not keep the file name.

----EDIT----

Sorry I seem to have mis-understood the question; you can load multiple files using

sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }

the asterisk * should load in all files in the path assuming they are all supported input file types.

This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).

Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.

Iterating through files in scala to create values based on the file names

1 Answers1