0

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:

| abc:2 def:1 ghi:3 ...

I want to convert it to the following format:

abc abc def ghi ghi ghi

The new converted lines should go to S3 in directory /data/spark

Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.

The code:

import org.apache.spark.{SparkConf, SparkContext}

object Vw2SparkLdaFormatConverter {

  def repeater(s: String): String = {
      val ssplit = s.split(':')
        (ssplit(0) + ' ')  * ssplit(1).toInt
  }

  def main(args: Array[String]) {
      val inputPath = args(0)
      val outputPath = args(1)

      val conf = new SparkConf().setAppName("FormatConverter")
      val sc = new SparkContext(conf)

      val vwdata = sc.textFile(inputPath)
      val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)

      val coalescedSparkData = sparkdata.coalesce(100)
      coalescedSparkData.saveAsTextFile(outputPath)

      sc.stop()
  }
}

When I run this (as a Spark EMR job in AWS), the step fails with exception:

18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
    at ...

The code is run as:

spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark

I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.

What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.

Nik
  • 5,515
  • 14
  • 49
  • 75
  • BTW, the job does create the directory `/data/spark` in S3 (and I also see a `_temporary` in there), but the job fails a few moments after that. – Nik Jan 20 '18 at 00:47

1 Answers1

3

You could convert to a dataframe and then save with overwrite enabled.

coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)

Or if you insist on using RDD methods, you can do as described already in this answer

Gsquare
  • 689
  • 1
  • 5
  • 18
  • 1
    But why is it failing in the first place? – Nik Jan 21 '18 at 18:47
  • The above snippet worked, although since I still don't know the issue with the original code, I am not marking this as a solution (but upvoted your answer). Thanks. – Nik Jan 24 '18 at 00:26