0

I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.

My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.

Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.

You can see my code below which hopefully clearly shows what I am doing.

//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
  .map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))

//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
  file.flatMap { line =>
    if (!line.isEmpty && line(0) != '#') {
      val lineArray = line.split(",")
      if (lineArray.length < 0) {
        None
      } else {
        val srcId = lineArray(0).toInt
        val dstId = lineArray(1).toInt
        val ID = lineArray(2).toString
        (Array(Edge(srcId.toInt, dstId.toInt, ID)))
      }
    } else {
      None
    }
  }
}

//make graphs -This is where I want automation, so I can iterate through a 
//folder of edge files and output corresponding monthly graphs. 
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)

Any help or pointers on this would be much appreciated.

ALs
  • 509
  • 2
  • 4
  • 17

1 Answers1

0

you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc

   val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }

The Spark Context textFile only reads the data, but does not keep the file name.

----EDIT----

Sorry I seem to have mis-understood the question; you can load multiple files using

sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }

the asterisk * should load in all files in the path assuming they are all supported input file types.

This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).

Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.

GameOfThrows
  • 4,510
  • 2
  • 27
  • 44