spark job freeze when started in ParArray

Question

I want to convert a set of time-serial data to Labeledpoint from multiple csv files and save to parquet file. Csv Files are small, usually < 10MiB

When I start it with ParArray, it submit 4 jobs a time and freeze . codes here

val idx = Another_DataFrame
ListFiles(new File("data/stock data"))
.filter(_.getName.contains(".csv")).zipWithIndex
.par //comment this line and code runs smoothly
.foreach{
  f=>
      val stk = spark_csv(f._1.getPath) //doing good
      ColMerge(stk,idx,RESULT_PATH(f)) //freeze here
    stk.unpersist()
}

and the freeze part:

def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = {
val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache
val head = df.head
val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name")
val toMap = col.filter{
  e=>head.get(head.fieldIndex(e)).isInstanceOf[String]
}.sorted
val toCast = col.diff(toMap).filterNot(_=="data")
val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{
  row=>
    val res1= toCast.map{
      col=>
        row.getDouble(row.fieldIndex(col))
    }
    val res2= toMap.flatMap{
      col=>
        val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size)
        row.getString(row.fieldIndex(col)).split("；").par.foreach{
          word=>
            mapping(GlobalConfig.ColumnMapping(col)(word)) = 1
        }
        mapping
    }

    (
      row.getString(row.fieldIndex("code")),
      row.getString(row.fieldIndex("date")),
      res1++res2++row.getAs[Seq[Double]]("data")
      )
}.zipWithIndex.collect
df.unpersist
val dataset = GlobalConfig.sctx.makeRDD(res.map{
  day=>
    (day._1._1,
      day._1._2,
      try{
        new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3))
      }
      catch {
        case ex:ArrayIndexOutOfBoundsException=>
          new LabeledPoint(-1,Vectors.dense(day._1._3))
      }
      )
}).filter(_._3.label != -1).toDF("code","date","labeledpoint")
dataset.write.mode(SaveMode.Overwrite).parquet(PATH)
}

The exact job that freezes is the DataFrame.sort() or zipWithIndex when generating res in ColMerge

Since most part of the job get done after collect I really want to use ParArray to accelerate ColMerge but this weird freeze stopped me from doing so. Do I need to new a thread pool to do this?

Why is every thread writing to the same path? Is this desired? — Pankaj Arora, Feb 21 '16 at 04:27
@PankajArora PATH is generated foreach csv file. I use RESULT_PATH because the original expr is too long — skywalkerytx, Feb 21 '16 at 05:04

spark job freeze when started in ParArray

0 Answers0