0

I have a flat file on HDFS containing a list of companies

CompanyA
CompanyA Decription
April '12
San Fran
11-50
CompanyB
...

and i want to map this into a companies class

case class Company(company: String, 
                   desc: String, 
                   founded: Date, 
                   location: String, 
                   employees: String)

I have tried the following but it doesn't seem to map properly

val companiesText = sc.textFile(...)

val companies = companyText.map(
   lines => Company(
        lines(0).toString.replaceAll("\"", ""),
        lines(1).toString.replaceAll("\"", ""),
        lines(2).toString.replaceAll("\"", ""),
        lines(3).toString.replaceAll("\"", ""),
        lines(4).toString.replaceAll("\"", ""),
        lines(5).toString.replaceAll("\"", "")
    )
)

i know i am not doing the date properly here but that is not the issue.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
Eoin Lane
  • 641
  • 2
  • 6
  • 22
  • What is the error message? – Yuval Itzchakov Apr 06 '16 at 12:58
  • if i do and companies.count i get org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 67, sandbox.hortonworks.com): java.lang.StringIndexOutOfBoundsException: String index out of range: 5 – Eoin Lane Apr 06 '16 at 13:01
  • Unless you tell it somehow, Spark has no idea your file has one record per 5 lines. So in your map, each element is one line. So lines(0) is the first *character* of your string, not the first line. Eventually, you get a line with less than 6 characters, and so a StringIndexOutOfBoundsException – The Archetypal Paul Apr 06 '16 at 13:23
  • For another thing, your `case class` takes only 5 parameters, you are passing in 6. That won't work either. Not that that's your problem. – David Griffin Apr 06 '16 at 13:23
  • You need `sc.hadoopFile` and the n-line inoput format: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html – The Archetypal Paul Apr 06 '16 at 13:31
  • Details here: http://stackoverflow.com/questions/36052480/how-to-read-multiple-line-elements-in-spark – The Archetypal Paul Apr 06 '16 at 13:34
  • 1
    Mine was more fun to write. – David Griffin Apr 06 '16 at 13:36
  • Can't disagree with that :) – The Archetypal Paul Apr 06 '16 at 13:38
  • David, do mind posting your answer again please? I don't need this to run quickly. I am sure the newAPIHadoopFile example is better but it seem complex to me. – Eoin Lane Apr 06 '16 at 13:58
  • I voted to undelete, but really, the `hadoopFile` isn't complex. All you need is in the answer in the linked-to question. – The Archetypal Paul Apr 06 '16 at 16:26

0 Answers0